DUDES: Deep Uncertainty Distillation using Ensembles for Semantic Segmentation

Deep neural networks lack interpretability and tend to be overconfident, which poses a serious problem in safety-critical applications like autonomous driving, medical imaging, or machine vision tasks with high demands on reliability. Quantifying the predictive uncertainty is a promising endeavour to open up the use of deep neural networks for such applications. Unfortunately, current available methods are computationally expensive. In this work, we present a novel approach for efficient and reliable uncertainty estimation which we call Deep Uncertainty Distillation using Ensembles for Segmentation (DUDES). DUDES applies student-teacher distillation with a Deep Ensemble to accurately approximate predictive uncertainties with a single forward pass while maintaining simplicity and adaptability. Experimentally, DUDES accurately captures predictive uncertainties without sacrificing performance on the segmentation task and indicates impressive capabilities of identifying wrongly classified pixels and out-of-domain samples on the Cityscapes dataset. With DUDES, we manage to simultaneously simplify and outperform previous work on Deep Ensemble-based Uncertainty Distillation.


Introduction
Semantic segmentation is a computer vision task that aims to assign a class label to each pixel in a given image, with the goal of understanding the semantic content of the image.Hence, it can be viewed as a pixel-wise classification task.Recently, methods based on deep neural networks have become the most popular and successful approach to solve this problem [29].Despite their unrivaled performance on established benchmark datasets like Cityscapes [5] or PASCAL VOC [7], neural networks lack interpretability [11], are unable to distinguish between indomain and out-of-domain samples [22], and tend to be overconfident [12].These shortcomings are especially se- vere for safety-critical applications like autonomous driving [27] and the analysis of medical imaging [23] or computer vision tasks that have high demands on reliability like industrial inspection [37] and automation [38].
Quantifying the predictive uncertainty is a promising endeavour to make such applications safer and more reliable, e.g., by preemptively making risk-averse predictions or by providing feedback to a human operator when predictions are uncertain.Some of the most relevant methods include Bayesian Neural Networks [25], Monte Carlo Dropout [9], and Deep Ensembles (DE) [21].Unfortunately, all of these methods require computationally expensive estimation of a distribution of outputs by sampling from a stochastic process.Recently, the concept of Knowledge Distillation (KD) arXiv:2303.09843v1[cs.CV] 17 Mar 2023 has been introduced as a potential solution [1,16,34,35].Knowledge distillation is a technique for transferring the knowledge embodied in a complex model, referred to as the teacher, to a smaller model, referred to as the student.By incorporating the knowledge learned by a more complex model, the student's performance can be enhanced [15,26,33].
In this work, we present a novel approach for efficient and reliable uncertainty quantification, which we call 'Deep Uncertainty Distillation using Ensembles for Segmentation' (DUDES) as shown in Figure 1.DUDES applies studentteacher distillation with a DE to accurately approximate predictive uncertainties while maintaining simplicity and adaptability.In comparison to the DE, only a single forward pass is required to obtain predictive uncertainties, which massively reduces the inference time and eliminates the computational overhead.DUDES simultaneously simplifies and outperforms previous work on DE-based uncertainty distillation, which we experimentally evaluate on the Cityscapes dataset.Additionally, it is worth noting that, the DE can in principle be substituted by any other Uncertainty Quantification (UQ) method.
After a brief overview of the related works on UQ and KD in Section 2, the methodology of DUDES is described in Section 3. In Section 4, we demonstrate the ability of DUDES through quantitative and qualitative analysis of the predictive uncertainties, and the potential to identify wrongly classified pixels and out-of-domain samples on the Cityscapes dataset.Section 5 discusses the experimental results and their potential impact on future research, while Section 6 concludes the paper.

Related Work
In this section, we briefly summarize foundational work to DUDES.The two main methodological components of our approach are UQ (cf.Section 2.1) and KD (cf.Section 2.2).

Uncertainty Quantification
Deep neural networks consist of a large number of model parameters and often include non-linearities, which generally makes the exact posterior probability distribution of a network's output prediction intractable [3,24].This leads to approximate UQ approaches including softmax probability, Bayesian techniques like Bayesian Neural Networks (BNN)s [25] and Monte Carlo Dropout [9] as well as DEs [21].
While the softmax predictions are easy to implement, the predicted probabilities tend to be overconfident and need to be calibrated [12].Additionally, they are often erroneously interpreted as model confidence [9].A mathematically sound approach based on Bayesian inference is provided by BNNs where a deterministic network is transformed into a stochastic one.This is done by either placing probability distributions over the activations or the weights [19].For example, Bayes by Backprop [3] uses variational inference to learn approximate distributions over the weights.At test time, weights are sampled from the learned distributions, resulting in an ensemble of models that is used to sample from the posterior distribution over the predictions.To overcome the high computational cost of BNNs, Gal and Ghahramani [9] propose Monte Carlo Dropout as an approximation of a stochastic gaussian process using a common regularization method.While dropout regularization [36] is usually only used during training, Monte Carlo Dropout applies this technique to sample from the posterior distribution of the predictions at test time.Since Monte Carlo Dropout only captures the uncertainty inherent to the model, the method is combined with learned uncertainty predictions and assumed density filtering [10] by Kendall and Gal and Loquercio et al. [20,24] to obtain the total uncertainty in the predictions respectively.
The uncertainties produced by Monte Carlo Dropout are not calibrated [9], which is a major drawback that is overcome by DEs [21] where an ensemble of trained models produces samples of predictions at test time.Due to randomness introduced by random weight initialization or different data augmentations across the ensemble members [8], DEs are well-calibrated [21] and outperform UQ approaches like softmax probability, Monte Carlo Dropout, and Bayes by Backprop, as is shown by Ovadia et al. [30].The latter also show that DEs seem to be more robust against dataset shift, which was also observed by Wursthorn et al. [39].

Knowledge Distillation
KD is a technique for transferring the knowledge embodied in a complex model, referred to as the teacher, to a smaller model, referred to as the student.The teacher can be a model with more parameters or even a DE.The student is trained to imitate the predictions of the teacher on a given dataset, with the goal of minimizing the difference of the student's outputs and the teacher's outputs.By incorporating the knowledge learned by a more complex model, the student's performance can be enhanced.Usually, this results in a more compact student model that achieves similar performance compared to the teacher model.The most formative works on KD were published by Hinton et al. [15], Romero et al. [33], and Malinin et al. [26].
Recently, the concept of KD has attracted increasing interest in the context of efficient UQ to enable real-time uncertainty estimation [1,16,34,35].For instance, Shen et al. [34] have used student-teacher distillation for real-time UQ based on Monte Carlo Dropout [36].Even more related to our method is the work published by Holder and Shafique [16] on DE-based student-teacher distillation for efficient UQ and out-of-domain detection.Their approach requires custom segmentation and uncertainty head architectures, two additional losses, and they introduce three new hyperparameters for the distillation process.This makes their proposed method difficult to implement and to adapt to new applications.Aside from that, their student struggles to compete with the teacher's segmentation performance and does not accurately approximate the class-wise uncertainties in some cases.In contrast, DUDES does not need custom segmentation and uncertainty head architectures, only introduces a single uncertainty loss without hyperparameters, outperforms the teacher's segmentation results, and accurately captures the teacher's uncertainties.
Thereby, DUDES provides a significant improvement to all of the shortcomings of the method proposed by Holder and Shafique [16].

Methodology
In the following, we provide an overview of DUDES, explain the methodology behind our uncertainty distillation approach, and lay out the implementation details.Our goal is to present an easy-to-adapt framework for efficient and reliable uncertainty estimation that utilizes student-teacher distillation.

Overview
DUDES is a framework for efficient UQ through studentteacher distillation.The overall goal is to train a student model that can simultaneously output a segmentation prediction and a corresponding predictive uncertainty as shown by Figure 1.We propose a two-step framework: 1. Training the teacher with the ground truth labels.

2.
Training the student with the ground truth labels and the teacher's uncertainty predictions.
As shown in Figure 2, the training of the student model consists of two loss components.The first component mea-sures the distance between the student's segmentation prediction and the ground truth labels, while the second component measures the distance between the student's uncertainty prediction and the output of one of the UQ methods described in Section 2.1.As mentioned before, we choose a DE as the teacher for the concrete implementation of DUDES.DEs are simple to implement, easily parallelizable, and require little tuning.They are well-calibrated, more robust against dataset shift, and outperform other UQ methods like Monte Carlo Dropout or Bayesian Neural Networks [8,21,30,39].However, it is worth noting that the DE can in principle be substituted by any other UQ method.Teacher.To provide meaningful uncertainties, we use a DE as the teacher, which consists of ten baseline models that are not pre-trained, thus following prior work on DE-based uncertainty quantification [8,21].By randomly initializing all the parameters before training, we aim to capture different aspects of the input data distribution for each ensemble member, boosting the teacher's overall performance, robustness, and uncertainty quantification capabilities.During inference, each ensemble member produces slightly different predictions, enabling the calculation of a mean segmentation prediction and an uncertainty prediction.
Student.As our student not only has to output a segmentation prediction, but also a corresponding predictive uncertainty, we add a second head to the baseline model's decoder.We propose to use an additional uncertainty head that is identical to the regular segmentation head of the baseline model, except for the output layer.For the segmentation head, we use a softmax activation to obtain class-wise probability distributions.Whereas for the uncertainty head, we use a sigmoid activation that bounds outputs between 0 and 1.Our uncertainty head only needs one output channel instead of the number of classes, as needed by the segmentation head.Since this is a key modification to improve upon previous work by Holder and Shafique [16], we will discuss this simpflication in detail in Section 5.In contrast to the randomly initialized ensemble members, the student's parameters are initialized with ImageNet pre-training [6] to improve convergence speed.

Uncertainty Distillation
To efficiently estimate the predictive uncertainty of the DE with a single student model, we utilize student-teacher distillation to train our student to behave like the teacher with a combination of two losses.
Segmentation Loss.The main objective function that is being minimized for the segmentation task is the wellknown categorical cross-entropy loss: where L S is the segmentation loss for a single image, C is the number of classes, y i is the ground truth label for the ith class, and p i is the predicted probability for the i-th class.
The categorical cross-entropy loss measures the dissimilarity between the ground truth probability distribution and the predicted probability distribution.By minimizing this loss during training, the model is encouraged to produce pixelwise class predictions that are as close as possible to the ground truth classes.Uncertainty Loss.For distilling the predictive uncertainties of our teacher into the student, we introduce an additional uncertainty loss, which is formulated as the root mean squared logarithmic error (RMSLE): where L U is the uncertainty loss for a single image, N is the number of pixels in the image, z i is the teacher's predictive uncertainty for the i-th pixel as ground truth, and q i is the corresponding student's uncertainty prediction.The teacher's predictive uncertainty z i represents the standard deviation of the softmax probabilities of the predicted class in the segmentation map.By minimizing the RMSLE during training, the student is encouraged to produce uncertainty estimates that are as close as possible to the teacher's uncertainties.Since most of the uncertainties are close to zero, the natural logarithm provides special attention to the pixels where uncertainties are higher.Total Loss.The total loss for training our student model is the sum of the segmentation loss described in Equation 1and the uncertainty loss expressed in Equation 2: (3)

Implementation Details
Baseline.For our baseline model, we use a DeepLabv3+ [4] as the decoder and a ResNet-18 [14] as the backbone.Both architectures are adapted from Iakubovskii [18] with PyTorch [31].All the baseline models inside the teacher are trained with just the segmentation loss described in Equation 1.
Data Augmentation.To prevent overfitting, we apply the following data augmentation strategy to all training procedures: 1. Random scaling with a scaling factor between 0.5 and 2.0, 2. Random cropping with the crop size of 768 × 768, 3. Random horizontal flipping with a flip chance of 50%.
Training.For all training processes, we employ a Stochastic Gradient Descent (SGD) optimizer based on Robbins and Monro [32] with an initial learning rate of 0.01, momentum of 0.9, and weight decay of 0.0005 as optimizer-specific hyperparameters.In all experiments, the decoder's learning rate is ten times that of the backbone.Additionally, we use polynomial learning rate scheduling to decay the initial learning rate during the training process: where lr is the current learning rate, and lr initial is the initial learning rate.For training the ensemble members, we use the well-known categorical cross-entropy loss described in Equation 1 as the main objective function.In all of the training processes, we train for 200 epochs with a batch size of 16 using mixed precision [28] on a NVIDIA A100 GPU with 40 GB of memory.

Experiments
In this section, we demonstrate a variety of experiments conducted on the basis of the Cityscapes dataset to manifest the value of DUDES.Firstly, we compare the inference time and class-wise uncertainties between the teacher and the student model.Secondly, we evaluate the student's predictions qualitatively.Thirdly, we provide ablation studies.

Dataset
All of our experiments are based on the Cityscapes dataset [5] during training and evaluation in the segmentation task, but they are used to evaluate the uncertainty outputs as they indicate the model's ability to distinguish between in-domain and out-of-domain samples.

Quantitative Evaluation
Table 1 and Table 2 outline a quantitative comparison between the student's and the teacher's Intersection over Union (IoU) as well as their predictive uncertainties.The results of Holder and Shafique [16] have been included as they are the most relevant previous work on DE-based student-teacher distillation for efficient UQ.Their teacher is based on 25 DeepLabv3+ models with a MobileNet backbone [17].The MobileNet backbone and our ResNet-18 backbone have been shown to have very similar performance [2].
Segmentation Prediction.As shown in Table 1, our student network outperforms the teacher on the segmentation task for all classes except wall and sky, with an average improvement of 2.5% in mIoU.We attribute this improvement to the student's ImageNet pre-training as compared to the randomly initialized ensemble members.The student by Holder and Shafique [16] outperforms its teacher on the segmentation task for only one class and showed a mIoU deterioration of 4.2%.
Uncertainty Prediction.Table 2 shows that our student's approximation of the teacher's uncertainties is very accurate: In 10 out of the 19 classes our student's class-wise uncertainties deviate by less than 0.01 compared to the teacher's.Our student manages to deviate by less than 0.03 in 17 out of the 19 classes, with a maximum deviation of 0.042 for the bicycle class.On the other hand, the student by Holder and Shafique [16] deviates by less than 0.01 in 5 out of the 19 classes and by less than 0.03 in only 13 out of the 19 classes.Their student's maximum difference is 0.130 for the train class.On average across all classes, both students' uncertainties deviate only slightly from those of the teachers, with our student model deviating by 0.002 and the student by Holder and Shafique [16] deviating by -0.007.Generally speaking, both students struggle with accurately approximating the teacher's uncertainties for the last five classes: Truck, bus, train, motorbike, and bicycle.For these classes, our student has an average absolute deviation of 0.028, while the student by Holder and Shafique [16] deviates by 0.066.
Figure 4 displays another comparison between the student's and the teacher's ability to approximate reliable uncertainties: For this analysis, we progressively ignored an increasing percentage of pixels in the segmentation prediction and simultaneously re-evaluated the mIoU.Thereby, the pixels were sorted based on their predictive uncertainty in descending order.This initally removes the pixels with the most uncertain segmentation predictions from the evaluation until only the pixels with the most certain predictions are left.Consequently, meaningful uncertainties should result in a monotonically increasing function.4. Comparison between the student's and the teacher's mean Intersection over Union (mIoU).We progressively ignore an increasing percentage of pixels in the segmentation prediction and simultaneously re-evaluated the mIoU.The pixels are sorted based on their predictive uncertainty in descending order, thus removing the most uncertain segmentation predictions first.
As Figure 4 shows, the student as well as the teacher experience an almost linear rise in mIoU from 73.8% and 71.3%, respectively, to almost 100% after removing 90% of the most uncertain pixels.Both models attain a similar relative increase in mIoU by disregarding the first 10% of the most uncertain pixels.Up until ignoring 70% of pixels, the teacher reaches a mIoU of 92.5%, while the student only attains 89.6%.Beyond this point, the student's mIoU sur- passes the teacher's, with the student achieving 99.2% after ignoring 80% of the pixels with the highest uncertainties, while the teacher only reaches 95.9%.This analysis yields two key findings: Firstly, predictive uncertainties prove to be an effective approach of identifying misclassified pixels.Secondly, our student's predictive uncertainties deviate only slightly from the teacher's uncertainties, revealing that they are equally meaningful.Inference Time.Table 3 compares the inference time for a single image and the number of trainable parameters between the baseline, the teacher, and the student model.The experiment was conducted on a common NVIDIA GeForce RTX 3090 GPU with 24 GB of memory.What stands out the most is that there is only an insignificant difference of 0.2 milliseconds in inference time between the baseline and the student, despite the student's ability to output an additional predictive uncertainty.With 18.5 milliseconds per image, the student's inference time is roughly 11.7 times faster than the 217.1 milliseconds of the teacher.Table 3 also illustrates the number of trainable parameters, highlighting the efficiency of the student network.The additional uncertainty head of the student network only adds 257 parameters to the baseline model.

Qualitative Evaluation
Figure 3 displays four example images from the Cityscapes validation set and their corresponding ground truth labels, our student's segmentation prediction, a binary accuracy map, and the student's uncertainty prediction.The binary accuracy map visualizes incorrectly predicted pixels and void classes in white and correctly predicted pixels in black.
Visually, for large areas and well-represented classes like road, sidewalk, building, sky, and car the student's performance on the segmentation task is almost free of errors.This supports the quantitative evaluation described in Table 2. Like most segmentation models, our student struggles with class transitions, areas with lots of inherent noise, or areas that belong to the void class, which is visualized by the binary accuracy map.
A comparison of the binary accuracy map and our student's uncertainty prediction adds to the observations laid out in Table 2 and Figure 4: The uncertainty prediction reliably predicts high uncertainties for wrongly classified pixels and out-of-domain samples, which are visualized as white pixels in the binary accuracy map.For example, in the first image of Figure 3, our student correctly predicts high uncertainties in the noisy parts of the background and for fine geometric structures like traffic lights.Conversely, the student predicts very low uncertainties for the road, buildings, sky and vegetation.The second image example confirms this observation and adds two valuable insights about the quality of the student's uncertainty predictions.Firstly, although the train in the left part of the image is predicted correctly for the most part, the student still predicts high uncertainties.This is intuitively comprehensible and desired because the train class is underrepresented in the dataset and therefore potentially more difficult to detect reliably.Secondly, the student predicts high uncertainties in the bottom part of the image where reflections on the hood of the car cause incoherent segmentation predictions.The third image exemplifies another quality of the student's predictive uncertainty.In this case, the student struggles to correctly segment the truck in the right part of the image.Simultaneously, the student predicts high uncertainties for the entire truck, thus highlighting the flawed segmentation prediction.The fourth image demonstrates the student's capability to correctly identify out-of-domain samples, with high uncertainties predicted in areas belonging to the void class.

Ablation Studies
Number of Ensemble Members.An essential part of DUDES is the quality of the teacher's uncertainty prediction   4 shows the results of another ablation study on the impact of ImageNet [6] pretraining on the mIoU and mUnc.We comprehensively compare the baseline segmentation model with our student model and our teacher, which consists of ten randomly initialized baseline models.The study does not examine the impact of ImageNet pre-training on the ensemble members as this would lead to less reliable uncertainties compared to random initialization [8,21].
While training for 200 epochs and using random initialization, our student underperforms the baseline model by 3.9% and the teacher by 6.7% with a mIoU of 64.6% on the segmentation task.Our randomly initialized student also underestimates the teacher's uncertainties by 0.009 with a mUnc of 0.097.When using ImageNet pre-training for the baseline model and our student, both significantly improve their mIoU with 73.7% and 73.8% respectively.The student also manages to approximate the predictive uncertainties better with a mUnc of 0.108, which is close to the 0.106 of the teacher.It is worth noting that similar performance can also be achieved by randomly initializing our student when the number of training epochs is quadrupled to 800.This concurs with the findings of He et al. [13].As a consequence, we suggest using ImageNet pre-training for the student to improve convergence speed.

DISCUSSION
DUDES applies student-teacher distillation with a DE to accurately approximate predictive uncertainties with a single forward pass while maintaining simplicity and adaptability.Against the teacher, the needed inference time per image is reduced by an order of magnitude and the computational overhead in comparison to the baseline is neglectable.Additionally, the student showed impressive capabilities of identifying wrongly classified pixels and outof-domain samples within an image which is crucial for safety-critical applications such as autonomous driving and many other computer vision tasks.
In contrast to the work by Holder and Shafique [16], DUDES requires no major changes to the student's architecture and introduces only a single uncertainty loss without additional hyperparameters, yet delivers significant improvements over their work.Firstly, our student model outperforms its teacher in the segmentation task while their student suffers from a segmentation performance degradation in comparison to its teacher.Secondly, our student approximates its teacher's predictive uncertainties more closely than the student model by Holder and Shafique [16].More precisely, their student tends to underestimate uncertainties for classes with high uncertainties and vice versa, whereas our student does not suffer from any systematic shortcomings.
A major factor of the effectiveness of DUDES lies in the simplification of what is distilled.Instead of distilling the teacher's uncertainty map, which is what Holder and Shafique [16] proposed, we only use the teacher's predictive uncertainty.The teacher's uncertainty map is calculated by computing the standard deviation of the teacher's model's softmax probability maps along the class dimension.In the case of multiclass semantic segmentation, the resulting uncertainty map has dimensions of C × H × W , where C is the number of classes, H is the image height, and W is the image width.For DUDES, the class dimension is reduced to 1 by only considering the uncertainty of the predicted class in the segmentation map.Due to this simplification, the student's segmentation performance is not hindered and the predictive uncertainties can be learned more accurately.
We acknowledge the simplification in the uncertainty distillation to be a potential limitation of DUDES as the student is only capable of estimating the uncertainty of the predicted class.However, there are practically no negative implications of this limitation since the remaining uncertainties are usually discarded anyway.Hence, DUDES remains useful for efficiently estimating predictive uncertainties for a wide range of applications while being easy to adapt.
We believe that DUDES has the potential to provide a new promising paradigm in reliable uncertainty quantification by focusing on simplicity and efficiency.Except for the computational overhead during training, we found no apparent reason to not employ our proposed method in semantic segmentation applications where safety and reliability are critical.

CONCLUSION
In this work, we propose DUDES, an efficient and reliable uncertainty quantification method by applying studentteacher distillation that maintains simplicity and adaptability throughout the entire framework.We quantitatively demonstrated that DUDES accurately captures predictive uncertainties without sacrificing performance on the segmentation task.Additionally, qualitative results indicate impressive capabilities of identifying wrongly classified pixels and out-of-domain samples.With DUDES, we managed to simultaneously simplify and outperform previous work on DE-based UQ.
We hope that DUDES encourages other researchers to incorporate uncertainties into state-of-the-art semantic segmentation approaches and to explore the usefulness of our proposed method for other tasks such as detection or depth estimation.

Figure 1 .
Figure 1.'Deep Uncertainty Distillation using Ensembles for Segmentation' (DUDES) applies student-teacher distillation with a Deep Ensemble (DE) to accurately approximate predictive uncertainties with a single forward pass while maintaining simplicity and adaptability.

Figure 2 .
Figure 2. A schematic overview of the training process of DUDES.DUDES is an easy-to-adapt framework for efficiently estimating predictive uncertainty through student-teacher distillation.The student model simultaneously outputs a segmentation prediction alongside a corresponding uncertainty prediction.Training the student involves a regular segmentation loss with the ground truth labels and an additional uncertainty loss.As ground truth uncertainties, we compute the predictive uncertainty of a DE, thereby acting as the teacher.

Figure 3 .
Figure 3. Example images from the Cityscapes validation set (a) with corresponding ground truth labels (b), our student's segmentation predictions (c), a binary accuracy map (d), and the student's uncertainty prediction (e).White pixels in the binary accuracy map are either incorrect predictions or void classes, which appear black in the ground truth label.For the uncertainty prediction, brighter pixels represent higher predictive uncertainties.

Figure
Figure 4. Comparison between the student's and the teacher's mean Intersection over Union (mIoU).We progressively ignore an increasing percentage of pixels in the segmentation prediction and simultaneously re-evaluated the mIoU.The pixels are sorted based on their predictive uncertainty in descending order, thus removing the most uncertain segmentation predictions first.

Table 3 .
Comparison of the inference time for a single image in milliseconds and the number of trainable parameters between the baseline, the teacher, and the student model.The inference time and corresponding standard deviation are based on 25 independent runs.

Figure 5 .
Figure 5. Ablation study on the impact of the number of ensemble members on the mean Intersection over Union (mIoU) and mean Uncertainty (mUnc).

Table 1 .
, a freely available urban street scene dataset.It consists of 2975 training images, 500 validation images, and 1525 test images.Since the test images are not publicly available, we use the validation images for testing in all of our experiments.Each RGB image is 2048 × 1024 in size, with each pixel assigned to one of 19 class labels or a void label.The void ground truth pixels are excluded Quantitative comparison between the student's and the teacher's class-wise Intersection over Union (IoU).Higher IoU values denote better segmentation results, which are preferred.For the difference, the teacher's results are subtracted from the student's results.

Table 2 .
Quantitative comparison between the student's and the teacher's class-wise predictive uncertainties.In this case, a smaller difference is preferred as the student is trained to predict the same uncertainties as the teacher.The differences are calculated by subtracting the teacher's results from the student's results.They are highlighted based on the absolute differences being: ≤ 0.01 , ≤ 0.02 , ≤ 0.03 , ≤ 0.04, ≤ 0.05 , ≤ 0.06 , ≥ 0.06 .