A survey of uncertainty in deep neural networks

Over the last decade, neural networks have reached almost every field of science and become a crucial part of various real world applications. Due to the increasing spread, confidence in neural network predictions has become more and more important. However, basic neural networks do not deliver certainty estimates or suffer from over- or under-confidence, i.e. are badly calibrated. To overcome this, many researchers have been working on understanding and quantifying uncertainty in a neural network’s prediction. As a result, different types and sources of uncertainty have been identified and various approaches to measure and quantify uncertainty in neural networks have been proposed. This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges, and identifies potential research opportunities. It is intended to give anyone interested in uncertainty estimation in neural networks a broad overview and introduction, without presupposing prior knowledge in this field. For that, a comprehensive introduction to the most crucial sources of uncertainty is given and their separation into reducible model uncertainty and irreducible data uncertainty is presented. The modeling of these uncertainties based on deterministic neural networks, Bayesian neural networks (BNNs), ensemble of neural networks, and test-time data augmentation approaches is introduced and different branches of these fields as well as the latest developments are discussed. For a practical application, we discuss different measures of uncertainty, approaches for calibrating neural networks, and give an overview of existing baselines and available implementations. Different examples from the wide spectrum of challenges in the fields of medical image analysis, robotics, and earth observation give an idea of the needs and challenges regarding uncertainties in the practical applications of neural networks. Additionally, the practical limitations of uncertainty quantification methods in neural networks for mission- and safety-critical real world applications are discussed and an outlook on the next steps towards a broader usage of such methods is given.

Abstract-Over the last decade, neural networks have reached almost every field of science and became a crucial part of various real world applications.Due to the increasing spread, confidence in neural network predictions became more and more important.However, basic neural networks do not deliver certainty estimates or suffer from over or under confidence, i.e. are badly calibrated.To overcome this, many researchers have been working on understanding and quantifying uncertainty in a neural network's prediction.As a result, different types and sources of uncertainty have been identified and a variety of approaches to measure and quantify uncertainty in neural networks have been proposed.This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges, and identifies potential research opportunities.It is intended to give anyone interested in uncertainty estimation in neural networks a broad overview and introduction, without presupposing prior knowledge in this field.For that, a comprehensive introduction to the most crucial sources of uncertainty is given and their separation into reducible model uncertainty and not reducible data uncertainty is presented.The modeling of these uncertainties based on deterministic neural networks, Bayesian neural networks, ensemble of neural networks, and test-time data augmentation approaches is introduced and different branches of these fields as well as the latest developments are discussed.For a practical application, we discuss different measures of uncertainty, approaches for the calibration of neural networks and give an overview of existing baselines and available implementations.Different examples from the wide spectrum of challenges in the fields of medical image analysis, robotic and earth observation give an idea of the needs and challenges regarding uncertainties in practical applications of neural networks.Additionally, the practical limitations of uncertainty quantification methods in neural networks for mission-and safety-critical real world applications are discussed and an outlook on the next steps towards a broader usage of such methods is given.
Index Terms-Bayesian deep neural networks, Ensembles, Test-time augmentation, Calibration, Uncertainty

I. INTRODUCTION
Within the last decade enormous advances on deep neural networks (DNNs) have been realized, encouraging their adaptation in a variety of research fields, where complex systems have to be modeled or understood, as for example earth observation, medical image analysis or robotics.Although DNNs have become attractive in high risk fields such as medical image analysis [1], [2], [3], [4], [5], [6] or autonomous vehicle control [7], [8], [9], [10], their deployment in mission- • the lack of expressiveness and transparency of a deep neural network's inference model, which makes it difficult to trust their outcomes [2], • the inability to distinguish between in-domain and out-ofdomain samples [11], [12] and the sensitivity to domain shifts [13], • the inability to provide reliable uncertainty estimates for a deep neural network's decision [14] and frequently occurring overconfident predictions [15], [16], • the sensitivity to adversarial attacks that make deep neural networks vulnerable for sabotage [17], [18], [19].
These factors are mainly based on an uncertainty already included in the data (data uncertainty) or a lack of knowledge of the neural network (model uncertainty).To overcome these limitations, it is essential to provide uncertainty estimates, such that uncertain predictions can be ignored or passed to human experts [20].Providing uncertainty estimates is not only important for a safe decision-making in high-risks fields, but also crucial in fields where the data sources are highly inhomogeneous and labeled data is rare, such as in remote sensing [21], [22].Also for fields where uncertainties form a crucial part of the learning techniques, such as for active learning [23], [24], [25], [26] or reinforcement learning [20], [27], [28], [29], uncertainty estimates are highly important.In recent years, researchers have shown an increased interest in estimating uncertainty in DNNs [30], [20], [31], [32], [33], [34], [35], [36].The most common way to estimate the uncertainty on a prediction (the predictive uncertainty) is based on separately modelling the uncertainty caused by the model (epistemic or model uncertainty) and the uncertainty caused by the data (aleatoric or data uncertainty).While the first one is reducible by improving the model which is learned by the DNN, the latter one is not reducible.The most important approaches for modeling this separation are Bayesian inference [30], [20], [37], [9], [38], ensemble approaches [31], [39], [40], test time data augmentation approaches [41], [42], or single deterministic networks containing explicit components to represent the model and the data uncertainty [43], [44], [45], [32], [46].Estimating the predictive uncertainty is not sufficient for safe decision-making.Furthermore, it is crucial to assure that the uncertainty estimates are reliable.To this end, the calibration property (the degree of reliability) of DNNs has been investigated and re-calibration methods have been proposed [15], [47], [48] to obtain reliable (well-calibrated) uncertainty estimates.
There are several works that give an introduction and overview of uncertainty in statistical modelling.Ghanem et al. [49] published a handbook about uncertainty quantification, which includes a detailed and broad description of different concepts of uncertainty quantification, but without explicitly focusing on the application of neural networks.The theses of Gal [50] and Kendall [51] contain a good overview about Bayesian neural networks, especially with focus on the Monte Carlo (MC) Dropout approach and its application in computer vision tasks.The thesis of Malinin [52] also contains a very good introduction and additional insights into Prior networks.Wang et al. contributed two surveys on Bayesian deep learning [53], [54].They introduced a general framework and the conceptual description of the Bayesian neural networks (BNNs), followed by an updated presentation of Bayesian approaches for uncertainty quantification in neural networks with special focus on recommender systems, topic models, and control.In [55] an evaluation of uncertainty quantification in deep learning is given by presenting and comparing the uncertainty quantification based on the softmax output, the ensemble of networks, Bayesian neural networks, and autoencoders on the MNIST data set.Regarding the practicability of uncertainty quantification approaches for real life mission-and safety-critical applications, Gustafsson et al. [56] introduced a framework to test the robustness required in real-world computer vision applications and delivered a comparison of two popular approaches, namely MC Dropout and Ensemble methods.Hüllermeier et al. [57] presented the concepts of aleatoric and epistemic uncertainty in neural networks and discussed different concepts to model and quantify them.In contrast to this, Abdar et al. [58] presented an overview on uncertainty quantification methodologies in neural networks and provide an extensive list of references for different application fields and a discussion of open challenges.
In this work, we present an extensive overview over all concepts that have to be taken into account when working with uncertainty in neural networks while keeping the applicability on real world applications in mind.Our goal is to provide the reader with a clear thread from the sources of uncertainty to applications, where uncertainty estimations are needed.Furthermore, we point out the limitations of current approaches and discuss further challenges to be tackled in the future.For that, we provide a broad introduction and comparison of different approaches and fundamental concepts.The survey is mainly designed for people already familiar with deep learning concepts and who are planning to incorporate uncertainty estimation into their predictions.But also for people already familiar with the topic, this review provides a useful overview of the whole concept of uncertainty in neural networks and their applications in different fields.In summary, we comprehensively discuss • Sources and types of uncertainty (Section II), • Recent studies and approaches for estimating uncertainty in DNNs (Section III), • Uncertainty measures and methods for assessing the quality and impact of uncertainty estimates (Section IV), • Recent studies and approaches for calibrating DNNs (Section V), • An overview over frequently used evaluation data sets, available benchmarks and implementations1 (Section VI), • An overview over real-world applications using uncertainty estimates (Section VII), • A discussion on current challenges and further directions of research in the future (Section VIII).In general, the principles and methods for estimating uncertainty and calibrating DNNs can be applied to all regression, classification, and segmentation problems, if not stated differently.In order to get a deeper dive into explicit applications of the methods, we refer to the section on applications and to further readings in the referenced literature.

II. UNCERTAINTY IN DEEP NEURAL NETWORKS
A neural network is a non-linear function f θ parameterized by model parameters θ (i.e. the network weights) that maps from a measurable input set X to a measurable output set Y, i.e.
For a supervised setting, we further have a finite set of training data D ⊆ D = X × Y containing N data samples and corresponding targets, i.e.
For a new data sample x * ∈ X, a neural network trained on D can be used to predict a corresponding target f θ (x * ) = y * .We consider four different steps from the raw information in the environment to a prediction by a neural network with quantified uncertainties, namely 1) the data acquisition process: The occurrence of some information in the environment (e.g. a bird's singing) and a measured observation of this information (e.g. an audio record).
2) the DNN building process: The design and training of a neural network.
3) the applied inference model: The model which is applied for inference (e.g. a Bayesian neural network or an ensemble of neural networks).
4) the prediction's uncertainty model: The modelling of the uncertainties caused by the neural network and by the data.In practice, these four steps contain several potential sources of uncertainty and errors, which again affect the final prediction of a neural network.The five factors that we think are the most vital for the cause of uncertainty in a DNN's predictions are • the variability in real world situations, • the errors inherent to the measurement systems, • the errors in the architecture specification of the DNN, • the errors in the training procedure of the DNN, • the errors caused by unknown data.
In the following, the four steps leading from raw information to uncertainty quantification on a DNN's prediction are described in more detail.Within this, we highlight the sources of uncertainty that are related to the single steps and explain how the uncertainties are propagated through the process.Finally, we introduce a model for the uncertainty on a neural network's prediction and introduce the main types of uncertainty considered in neural networks.The goal of this section is to give an accountable idea of the uncertainties in neural networks.Hence, for the sake of simplicity we only describe and discuss the mathematical properties, which are relevant for understanding the approaches and applying the methodology in different fields.

A. Data Acquisition
In the context of supervised learning, the data acquisition describes the process where measurements x and target variables y are generated in order to represent a (real world) situation ω from some space Ω.In the real world, a realization of ω could for example be a bird, x a picture of this bird, and y a label stating 'bird'.During the measurement, random noise can occur and information may get lost.We model this randomness in x by x|ω ∼ p x|ω . (3) Equivalently, the corresponding target variable y is derived, where the description is either based on another measurement or is the result of a labeling process 2 .For both cases the description can be affected by noise and errors and we state it as y|ω ∼ p y|ω .
A neural network is trained on a finite data set of realizations of x|ω i and y|ω i based on N real world situations ω 1 , ..., ω N , When collecting the training data, two factors can cause uncertainty in a neural network trained on this data.First, the sample space Ω should be sufficiently covered by the training data x 1 , ..., x N for ω 1 , ..., ω N .For that, one has to take into account that for a new sample x * it in general holds that x * = x i for all training situations x i .Following, the target has to be estimated based on the trained neural network model, which directly leads to the first factor of uncertainty: Factor I: Variability in Real World Situations Most real world environments are highly variable and almost constantly affected by changes.These changes affect parameters as for example temperature, illumination, clutter, and physical objects' size and shape.Changes in the environment can also affect the expression of objects, as for example plants after rain look very different to plants after a drought.When real world situations change compared to the training set, this is called a distribution shift.Neural networks are sensitive to distribution shifts, which can lead to significant changes in the performance of a neural network.
The second case is based on the measurement system, which has a direct effect on the correlation between the samples and the corresponding targets.The measurement system generates information x i and y i that describe ω i but might not contain enough information to learn a direct mapping from x i to y i .This means that there might be highly different real world information ω i and ω j (e.g.city and forest) resulting in very similar corresponding measurements x i and x j (e.g.temperature) or similar corresponding targets y i and y j (e.g.label noise that labels both samples as forest).This directly leads to our second factor of uncertainty: Factor II: Error and Noise in Measurement Systems The measurements themselves can be a source of uncertainty on the neural network's prediction.This can be caused by limited information in the measurements, as for example the image resolution, or by not measuring false or insufficiently available information modalities.Moreover, it can be caused by noise, for example sensor noise, by motion, or mechanical stress leading to imprecise measures.Furthermore, false labeling is also a source of uncertainty that can be seen as error and noise in the measurement system.It is referenced as label noise and affects the model by reducing the confidence on the true class prediction during training.

B. Deep Neural Network Design and Training
The design of a DNN covers the explicit modeling of the neural network and its stochastic training process.The assumptions on the problem structure induced by the design and training of the neural network are called inductive bias [59].We summarize all decisions of the modeler on the network's structure (e.g. the number of parameters, the layers, the activation functions, etc.) and training process (e.g.optimization algorithm, regularization, augmentation, etc.) in a structure configuration s.The defined network structure gives the third factor of uncertainty in a neural network's predictions: Factor III: Errors in the Model Structure The structure of a neural network has a direct effect on its performance and therefore also on the uncertainty of its prediction.For instance, the number of parameters affects the memorization capacity, which can lead to under-or over-fitting on the training data.Regarding uncertainty in neural networks, it is known that deeper networks tend to be overconfident in their softmax output, meaning that they predict too much probability on the class with highest probability score [15].
For a given network structure s and a training data set D, the training of a neural network is a stochastic process and therefore the resulting neural network f θ is based on a random variable, θ|D, s ∼ p θ|D,s .
The process is stochastic due to random decisions as the order of the data, random initialization or random regularization as augmentation or dropout.The loss landscape of a neural network is highly non-linear and the randomness in the training process in general leads to different local optima θ * resulting in different models [31].Also, parameters as batch size, learning rate, and the number of training epochs affect the training and result in different models.Depending on the underlying task these models can significantly differ in their predictions for single samples, even leading to a difference in the overall model performance.This sensitivity to the training process directly leads to the fourth factor for uncertainties in neural network predictions: Factor IV: Errors in the Training Procedure The training process of a neural network includes many parameters that have to be defined (batch size, optimizer, learning rate, stopping criteria, regularization, etc.) and also stochastic decisions within the training process (batch generation and weight initialization) take place.All these decisions affect the local optima and it is therefore very unlikely that two training processes deliver the same model parameterization.A training data set that suffers from imbalance or low coverage of single regions in the data distribution also introduces uncertainties on the network's learned parameters, as already described in the data acquisition.This might be softened by applying augmentation to increase the variety or by balancing the impact of single classes or regions on the loss function.
Since the training process is based on the given training data set D, errors in the data acquisition process (e.g.label noise) can result in errors in the training process.

C. Inference
The inference describes the prediction of an output y * for a new data sample x * by the neural network.At this time, the network is trained for a specific task.Thus, samples which are not inputs for this task cause errors and are therefore also a source of uncertainty: Factor V: Errors Caused by Unknown Data Especially in classification tasks, a neural network that is trained on samples derived from a world W 1 can also be capable of processing samples derived from a completely different world W 2 .This is for example the case, when a network trained on images of cats and dogs receives a sample showing a bird.Here, the source of uncertainty does not lie in the data acquisition process, since we assume a world to contain only feasible inputs for a prediction task.Even though the practical result might be equal to too much noise on a sensor or complete failure of a sensor, the data considered here represents a valid sample, but for a different task or domain.

D. Predictive Uncertainty Model
As a modeller, one is mainly interested in the uncertainty that is propagated onto a prediction y * , the so-called predictive uncertainty.Within the data acquisition model, the probability distribution for a prediction y * based on some sample x * is given by and a maximum a posteriori (MAP) estimation is given by Since the modeling is based on the unavailable latent variable ω, one takes an approximative representation based on a sampled training data set D = {x i , y i } N i=1 containing N samples and corresponding targets.The distribution and MAP estimator in (7) In general, the distribution given in ( 9) is unknown and can only be estimated based on the given data in D. For this estimation, neural networks form a very powerful tool for many tasks and applications.The prediction of a neural network is subject to both modeldependent and input data-dependent errors, and therefore the predictive uncertainty associated with y * is in general separated into data uncertainty (also statistical or aleatoric uncertainty [57]) and model uncertainty (also systemic or epistemic uncertainty [57]).Depending on the underlying approach, an additional explicit modeling of distributional uncertainty [32] is used to model the uncertainty, which is caused by examples from a region not covered by the training data.
1) Model-and Data Uncertainty: The model uncertainty covers the uncertainty that is caused by shortcomings in the model, either by errors in the training procedure, an insufficient model structure, or lack of knowledge due to unknown samples or a bad coverage of the training data set.
In contrast to this, data uncertainty is related to uncertainty that directly stems from the data.Data uncertainty is caused by information loss when representing the real world within a data sample and represents the distribution stated in (7).For example, in regression tasks noise in the input and target measurements causes data uncertainty that the network cannot learn to correct.In classification tasks, samples which do not contain enough information in order to identify one class with 100% certainty cause data uncertainty on the prediction.The information loss is a result of the measurement system, e.g. by representing real world information by image pixels with a specific resolution, or by errors in the labelling process.Considering the five presented factors for uncertainties on a neural network's prediction, model uncertainty covers Factors I, III, IV, and V and data uncertainty is related to Factor II.While model uncertainty can be (theoretically) reduced by improving the architecture, the learning process, or the training data set, the data uncertainties cannot be explained away [60].Therefore, DNNs that are capable of handling uncertain inputs and that are able to remove or quantify the model uncertainty and give a correct prediction of the data uncertainty are of paramount importance for a variety of real world missionand safety-critical applications.
The Bayesian framework offers a practical tool to reason about uncertainty in deep learning [61].In Bayesian modeling, the model uncertainty is formalized as a probability distribution over the model parameters θ, while the data uncertainty is formalized as a probability distribution over the model outputs y * , given a parameterized model f θ .The distribution over a prediction y * , the predictive distribution, is then given by The term p(θ|D) is referenced as posterior distribution on the model parameters and describes the uncertainty on the model parameters given a training data set D. The posterior distribution is in general not tractable.While ensemble approaches seek to approximate it by learning several different parameter settings and averaging over the resulting models [31], Bayesian inference reformulates it using Bayes Theorem [62] p The term p(θ) is called the prior distribution on the model parameters, since it does not take any information but the general knowledge on θ into account.The term p(D|θ) represents the likelihood that the data in D is a realization of the distribution predicted by a model parameterized with θ.
Many loss functions are motivated by or can be related to the likelihood function.Loss functions that seek to maximize the log-likelihood (for an assumed distribution) are for example the cross-entropy or the mean squared error [63].
Even with the reformulation given in (12), the predictive distribution given in (11) is still intractable.To overcome this, several different ways to approximate the predictive distribution were proposed.A broad overview on the different concepts and some specific approaches is presented in Section III.
2) Distributional Uncertainty: Depending on the approaches that are used to quantify the uncertainty in y * , the formulation of the predictive distribution might be further separated into data, distributional, and model parts [32]: The distributional part in (13) represents the uncertainty on the actual network output, e.g. for classification tasks this might be a Dirichlet distribution, which is a distribution over the categorical distribution given by the softmax output.Modeled this way, distributional uncertainty refers to uncertainty that is caused by a change in the input-data distribution, while model uncertainty refers to uncertainty that is caused by the process of building and training the DNN.As modeled in (13), the model uncertainty affects the estimation of the distributional uncertainty, which affects the estimation of the data uncertainty.While most methods presented in this paper only distinguish between model and data uncertainty, approaches specialized on out-of-distribution detection often explicitly aim at representing the distributional uncertainty [32], [64].A more detailed presentation of different approaches for quantifying uncertainties in neural networks is given in Section III.In Section IV, different measures for measuring the different types of uncertainty are presented.

E. Uncertainty Classification
On the basis of the input data domain, the predictive uncertainty can also be classified into three main classes: • In-domain uncertainty [65] In-domain uncertainty represents the uncertainty related to an input drawn from a data distribution assumed to be equal to the training data distribution.The in-domain uncertainty stems from the inability of the deep neural network to explain an in-domain sample due to lack of in-domain knowledge.From a modeler point of view, indomain uncertainty is caused by design errors (model uncertainty) and the complexity of the problem at hand (data uncertainty).Depending on the source of the indomain uncertainty, it might be reduced by increasing the quality of the training data (set) or the training process [57].
• Domain-shift uncertainty [13] Domain-shift uncertainty denotes the uncertainty related to an input drawn from a shifted version of the training distribution.The distribution shift results from insufficient coverage by the training data and the variability inherent to real world situations.A domain-shift might increase the uncertainty due to the inability of the DNN to explain the domain shift sample on the basis of the seen samples at training time.Some errors causing domain shift uncertainty can be modeled and can therefore be reduced.For example, occluded samples can be learned by the deep neural network to reduce domain shift uncertainty caused by occlusions [66].However, it is difficult if Fig. 1: Visualization of the data, the model, and the distributional uncertainty for classification and regression models.
not impossible to model all errors causing domain shift uncertainty, e.g., motion noise [60].From a modeler point of view, domain-shift uncertainty is caused by external or environmental factors but can be reduced by covering the shifted domain in the training data set.
• Out-of-domain uncertainty [67], [68], [69], [70] Out-of-domain uncertainty represents the uncertainty related to an input drawn from the subspace of unknown data.The distribution of unknown data is different and far from the training distribution.While a DNN can extract in-domain knowledge from domain-shift samples, it cannot extract in-domain knowledge from out-of-domain samples.For example, when domain-shift uncertainty describes phenomena like a blurred picture of a dog, out-of-domain uncertainty describes the case when a network that learned to classify cats and dogs is asked to predict a bird.The out-of-domain uncertainty stems from the inability of the DNN to explain an out-of-domain sample due to its lack of out-of-domain knowledge.From a modeler point of view, out-of-domain uncertainty is caused by input samples, where the network is not meant to give a prediction for or by insufficient training data.
Since the model uncertainty captures what the DNN does not know due to lack of in-domain or out-of-domain knowledge, it captures all, in-domain, domain-shift, and out-of-domain uncertainties.In contrast, the data uncertainty captures in-domain uncertainty that is caused by the nature of the data the network is trained on, as for example overlapping samples and systematic label noise.

III. UNCERTAINTY ESTIMATION
As described in Section II, several factors may cause model and data uncertainty and affect a DNN's prediction.This variety of sources of uncertainty makes the complete exclusion of uncertainties in a neural network impossible for almost all applications.Especially in practical applications employing real world data, the training data is only a subset of all possible input data, which means that a miss-match between the DNN domain and the unknown actual data domain is often unavoidable.However, an exact representation of the uncertainty of a DNN prediction is also not possible to compute, since the different uncertainties can in general not be modeled accurately and are most often even unknown.Therefore, methods for estimating uncertainty in a DNN prediction is a popular and vital field of research.The data uncertainty part is normally represented in the prediction, e.g. in the softmax output of a classification network or in the explicit prediction of a standard deviation in a regression network [60].In contrast to this, several different approaches which model the model uncertainty and seek to separate it from the data uncertainty in order to receive an accurate representation of the data uncertainty were introduced [60], [32], [31].Factor II is shown by insufficient measurements, that can not directly be used to separate between settlement and forest and by label noise.In practice, the resolution of such images can be low and which would also be part of Factor II.Factor III and Factor IV represent the uncertainties caused by the network structure and the stochastic training process, respectively.Factor V in contrast is represented by feeding the trained network with unknown types of images, namely cows and pigs.
In general, the methods for estimating the uncertainty can be split in four different types based on the number (single or multiple) and the nature (deterministic or stochastic) of the used DNNs.
• Single deterministic methods give the prediction based on one single forward pass within a deterministic network.The uncertainty quantification is either derived by using additional (external) methods or is directly predicted by the network.
• Bayesian methods cover all kinds of stochastic DNNs, i.e.DNNs where two forward passes of the same sample generally lead to different results.
• Ensemble methods combine the predictions of several different deterministic networks at inference.
• Test-time augmentation methods give the prediction based on one single deterministic network but augment the input data at test-time in order to generate several predictions that are used to evaluate the certainty of the prediction.
In the following, the main ideas and further extensions of the four types are presented and their main properties are discussed.In Figure 3, an overview of the different types and methods is given.In Figure 4, the different underlying principles that are used to differentiate between the different types of methods are presented.• Original works 12  • Stochastic MCMC 13  • Theoretic Advances    [20] 12 [78], [79], [80] 13 [81] 14 [82] 15 [83] 16 [63] 17 [84] 18 [31] 19 [85] 20 [86] 21 [87], [88], [89] 22 [90], [45] 23 [39] 24 [40] 25 [91] A. Single Deterministic Methods For deterministic neural networks the parameters are deterministic and each repetition of a forward pass delivers the same result.With single deterministic network methods for uncertainty quantification, we summarize all approaches where the uncertainty on a prediction y * is computed based on one single forward pass within a deterministic network.In the literature, several such approaches can be found.They can be roughly categorized into approaches where one single network is explicitly modeled and trained in order to quantify uncertainties [44], [32], [92], [64], [93] and approaches that use additional components in order to give an uncertainty estimate on the prediction of a network [46], [36], [71], [72].While for the first type, the uncertainty quantification affects the training procedure and the predictions of the network, the latter type is in general applied on already trained networks.Since trained networks are not modified by such methods, they have no effect on the network's predictions.In the following, we call these two types internal and external uncertainty quantification approaches.
1) Internal Uncertainty Quantification Approaches: Many of the internal uncertainty quantification approaches followed the idea of predicting the parameters of a distribution over the predictions instead of a direct pointwise maximum-aposteriori estimation.Often, the loss function of such networks takes the expected divergence between the true distribution and the predicted distribution into account as e.g., in [32], [94].The distribution over the outputs can be interpreted as a quantification of the model uncertainty (see Section II), trying to emulate the behavior of a Bayesian modeling of the network parameters [64].The prediction is then given as the expected value of the predicted distribution.For classification tasks, the output in general represents class probabilities.These probabilities are a result of applying the softmax function for multiclass settings and the sigmoid function for binary classification tasks on the logits z.These probabilities can be already interpreted as a prediction of the data uncertainty.However, it is widely discussed that neural networks are often over-confident and the softmax output is often poorly calibrated, leading to inaccurate uncertainty estimates [95], [67], [44], [92].Furthermore, the softmax output cannot be associated with model uncertainty.But without explicitly taking the model uncertainty into account, out-of-distribution samples could lead to outputs that certify a false confidence.For example, a network trained on cats and dogs will very likely not result in 50% dog and 50% cat when it is fed with the image of a bird.This is, because the network extracts features from the image and even though the features do not fit to the cat class, they might fit even less to the dog class.As a result, the network puts more probability on cat.Furthermore, it was shown that the combination of rectified linear unit (ReLu) networks and the softmax output leads to settings where the network becomes more and more confident as the distance between an out-of-distribution sample and the In practice other methods, could be utilized.For the deterministic approaches the idea of predicting the parameters of an probability distribution Ξ is visualized, other approaches which base on tools additional to the prediction network are not visualized here.
TABLE II: Overview over the properties of internal and external deterministic single network methods.For a comparison of single deterministic network approaches with Bayesian, ensemble, and test-time augmentation methods, see Table I.

Description
Estimate uncertainty using one evaluation of a single network without external components.
Estimate uncertainty using one evaluation of the network while relying on additional external components.

Implementation effort
Relatively low, but depends on explicit approach, often only loss and network output has to be fixed.
Relatively low, but depends on explicit approach.

Application on already trained networks possible
No Yes

Separated prediction and uncertainty estimation
No Yes learned training set becomes larger [96].Figure 5 shows an example where the rotation of a digit from MNIST leads to false predictions with high softmax values.This phenomenon is described and further investigated by Hein et al. [96] who proposed a method to avoid this behaviour, based on enforcing a uniform predictive distribution far away from the training data.
Several other classification approaches [44], [32], [94], [64] followed a similar idea of taking the logit magnitude into account, but make use of the Dirichlet distribution.The Dirichlet distribution is the conjugate prior of the categorical distribution and hence can be interpreted as a distribution over categorical distributions.The density of the Dirichlet distribution is defined by where Γ is the gamma function, α 1 , ..., α K are called the concentration parameters, and the scalar α 0 is the precision of the distribution.In practice, the concentrations α 1 , ..., α K are derived by applying a strictly positive transformation, as for example the exponential function, to the logit values.As visualized in Figure 6, a higher concentration value leads to a sharper Dirichlet distribution.The set of all class probabilities of a categorical distribution over k classes is equivalent to a k − 1-dimensional standard or probability simplex.Each node of this simplex represents a probability vector with the full probability mass on one class and each convex combination of the nodes represents a categorical distribution with the probability mass distributed over multiple classes.Malinin et al. [32] argued that a high model uncertainty should lead to a lower precision value and therefore to a flat distribution over the whole simplex, since the network is not familiar with the data.In contrast to this, data uncertainty should be represented by a sharper but also centered distribution, since the network can handle the data, but cannot give a clear class preference.In Figure 6 the different desired behaviors are shown.The Dirichlet distribution is utilized in several approaches as Dirichlet Prior Networks [43], [32] and Evidential Neural Networks [97], [44].Both of these network types output the parameters of a Dirichlet distribution from which the categorical distribution describing the class probabilities can be derived.The general idea of prior networks [32] is already described above and is visualized in Figure 6.Prior networks are trained in a multi task way with the goal of minimizing the expected Kullback-Leibler (KL) divergence between the predictions of in-distribution data and a sharp Dirichlet distribution and between a flat Dirichlet distribution and the predictions of out-of-distribution data [32].Besides the main motivation of a better separation between in-distribution and OOD samples, these approaches also improve the separation between the confidence of correct and incorrect predictions, as was shown by [98].As a follow up, [94] discussed that for the case that the data uncertainty is high, the forward definition of the KL-divergence can lead to an undesirable multi-model target distribution.In order to avoid this, they reformulated the loss using the reverse KLdivergence.The experiments showed improved results in the uncertainty estimation as well as for the adversarial robustness.Zhao et al. [99] extended the Dirichlet network approach by a new loss function that aims at minimizing an upper bound on the expected error based on the L ∞ -norm, i.e. optimizing an expected worst-case upper bound.[34] argued that using a mixture of Dirichlet distributions gives much more flexibility in approximating the posterior distribution.Therefore, an approach where the network predicts the parameters for a mixture of K Dirichlet distributions was suggested.For this, the network logits represent the parameters for M Dirichlet distributions and additionally M weights ω i , i = 1, .., M with the constraint M i=1 ω i = 1 are optimized.Nandy et al. [64] analytically showed that for in-domain samples with high data uncertainty, the Dirichlet distribution predicted for a false prediction is often flatter than for a correct prediction.They argued that this makes it harder to differentiate between in-and out-of-distribution predictions and suggested a regularization term towards maximizing the gap between in-and out-ofdistribution samples.Evidential neural networks [44] also optimize the parameterization of a single Dirichlet network.The loss formulation is derived by using subjective logic and interpret the logits as multinomial opinions or beliefs, as introduced in Evidence or Dempster-Shafer theory [100].Evidential neural networks set the total amount of evidence in relation with the number of classes and conclude a value of uncertainty from this, i.e. receiving an additional "I don't know class".The loss is formulated as expected value of a basic loss, as for example categorical cross entropy, with respect to a Dirichlet distribution parameterized by the logits.Additionally, a regularization term is added, encouraging the network to not consider features that provide evidence for multiple classes at the same time, as for example a circle is for 6 and 8. Due to this, the networks do not differentiate between data uncertainty and model uncertainty, but learn whether they can give a certain prediction or not.[33] extended this idea by differentiating between acuity and dissonance in the collected evidence in order to better separate in-and out-of-distribution samples.For that, two explicit data sets containing overlapping classes and out-of-distribution samples are needed to learn a regularization term.Amini et al. [101] transferred the idea of evidential neural networks from classification tasks to regression tasks by learning the parameters of an evidential normal inverse gamma distribution over an underlying Normal distribution.Charpentier et al. [102] avoided the need of OOD data for the training process by using normalizing flows to learn a distribution over a latent space for each class.A new input sample is projected onto this latent space and a Dirichlet distribution is parameterized based on the class wise densities of the received latent point.
Beside the Dirichlet distribution based approaches described above, several other internal approaches exist.In [68], a relatively simple approach based on small pertubations on the training input data and the temperature scaling calibration is presented leading to an efficient differentiation of in-and outof-distritbuion samples.Możejko et al. [92] made use of the inhibited softmax function.It contains an artificial and constant logit that makes the absolute magnitude of the single logits more determining in the softmax output.Van Amersfoort et al. [35] showed that Radial Basis Function (RBF) networks can be used to achieve competitive results in accuracy and very good results regarding uncertainty estimation.RBF networks learn a linear transformation on the logits and classify inputs based on the distance between the transformed logits and the learned class centroids.In [35], a scaled exponentiated L 2 distance was used.The data uncertainty can be directly derived from the distances between the centroids.By including penalties on the Jacobian matrix in the loss function, the network was trained to be more sensitive to changes in the input space.As a result, the method reached good performance on out-of-distribution detection.In several tests, the approach was compared to a five members deep ensemble [31] and it was shown that this single network approach performs at least equivalently well on detecting out-of-distribution samples and improves the truepositive rate.For regression tasks, Oala et al. [93] introduced an uncertainty score based on the lower and an upper bound output of an interval neural network.The interval neural network has the same structure as the underlying deterministic neural network and is initialized with the deterministic network's weights.In contrast to Gaussian representations of uncertainty given by a standard deviation, this approach can give non-symmetric values of uncertainty.Furthermore, the approach is found to be more robust in the presence of noise.Tagasovska and Lopez-Paz [103] presented an approach to estimate data and model uncertainty.A simultaneous quantile regression loss function was introduced in order to generate well-calibrated prediction intervals for the data uncertainty.The model uncertainty is quantified based on a mapping from the training data to zero, based on so called Orthonormal Certificates.The aim was that out-of-distribution samples, where the model is uncertain, are mapped to a non-zero value and thus can be recognized.Kawashima et al. [104] introduced a method which computes virtual residuals in the training samples of a regression task based on a cross-validation like pre-training step.With original training data expanded by the information of these residuals, the actual predictor is trained to give a prediction and a value of certainty.The experiments indicated that the virtual residu-als represent a promising tool in order to avoid overconfident network predictions.
2) External Uncertainty Quantification Approaches: External uncertainty quantification approaches do not affect the models' predictions, since the evaluation of the uncertainty is separated from the underlying prediction task.Furthermore, several external approaches can be applied to already trained networks at the same time without affecting each other.Raghu et al. [46] argued that when both tasks, the prediction and the uncertainty quantification, are done by one single method, the uncertainty estimation is biased by the actual prediction task.Therefore, they recommended a "direct uncertainty prediction" and suggested to train two neural networks, one for the actual prediction task and a second one for the prediction of the uncertainty on the first network's predictions.Similarly, Ramalho and Miranda [36] introduced an additional neural network for uncertainty estimation.But in contrast to [46], the representation space of the training data is considered and the density around a given test sample is evaluated.The additional neural network uses this training data density in order to predict whether the main network's estimate is expected to be correct or false.Hsu et al. [105] detected out-of-distribution examples in classification tasks at test-time by predicting total probabilities for each class, additional to the categorical distribution given by the softmax output.The class wise total probability is predicted by applying the sigmoid function to the network's logits.Based on these total probabilities, OOD examples can be identified as those with low class probabilities for all classes.In contrast to this, Oberdiek et al. [71] took the sensitivity of the model, i.e. the model's slope, into account by using gradient metrics for the uncertainty quantification in classification tasks.Lee et al. [72] applied a similar idea but made use of back propagated gradients.In their work they presented state of the art results on out-of-distribution and corrupted input detection.
3) Summing Up Single Deterministic Methods: Compared to many other principles, single deterministic methods are computational efficient in training and evaluation.For training, only one network has to be trained and often the approaches can even be applied on pre-trained networks.Depending on the actual approach, only a single or at most two forward passes have to be fulfilled for evaluation.The underlying networks could contain more complex loss functions, which slows down the training process [44] or external components that have to be trained and evaluated additionally [46].But in general, this is still more efficient than the number of predictions needed for ensembles based methods (Section III-C), Bayesian methods (Section III-B), and test-time data augmentation methods (Section III-D).A drawback of single deterministic neural network approaches is the fact that they rely on a single opinion and can therefore become very sensitive to the underlying network architecture, training procedure, and the training data.

B. Bayesian Neural Networks
Bayesian Neural Networks (BNNs) [106], [107], [108] have the ability to combine the scalability, expressiveness, and predictive performance of neural networks with the Bayesian learning as opposed to learning via the maximum likelihood principles.This is achieved by inferring the probability distribution over the network parameters θ = (w 1 , ..., w K ).
More specifically, given a training input-target pair (x, y) the posterior distribution over the space of parameters p(θ|x, y) is modelled by assuming a prior distribution over the parameters p(θ) and applying Bayes theorem: Here, the normalization constant in ( 16) is called the model evidence p(y|x) which is defined as Once the posterior distribution over the weights have been estimated, the prediction of an output y * for a new input data x * can be obtained by Bayesian Model Averaging or Full Bayesian Analysis that involves marginalizing the likelihood p(y|x, θ) with the posterior distribution: This Bayesian way of prediction is a direct application of the law of total probability and endows the ability to compute the principled predictive uncertainty.The integral of ( 18) is intractable for the most common prior posterior pairs and approximation techniques are therefore typically applied.The most widespread approximation, the Monte Carlo Approximation, follows the law of large numbers and approximates the expected value by the mean of N deterministic networks, f θ1 , ..., f θ N , parameterized by N samples, θ 1 , θ 2 , ..., θ N , from the posterior distribution of the weights, i.e.
Wilson and Izmailov [16] argue that a key advantage of BNNs lie in this marginalization step, which particularly can improve both the accuracy and calibration of modern deep neural networks.We note that the use-cases of BNNs are not limited for uncertainty estimation but open up the possibility to bridge the powerful Bayesian tool-boxes within deep learning.
Notable examples include Bayesian model selection [109], [110], [111], [112], model compression [76], [113], [114], active learning [115], [23], [116], continual learning [117], [118], [119], [120], theoretic advances in Bayesian learning [121] and beyond.While the formulation is rather simple, there exist several challenges.For example, no closed form solution exists for the posterior inference as conjugate priors do not typically exist for complex models such as neural networks [62].Hence, approximate Bayesian inference techniques are often needed to compute the posterior probabilities.Yet, directly using approximate Bayesian inference techniques have been proven to be difficult as the size of the data and number of parameters are too large for the use-cases of deep neural networks.In other words, the integrals of above equations are not computationally tractable as the size of the data and number of parameters grows.Moreover, specifying a meaningful prior for deep neural networks is another challenge that is less understood.
In this survey, we classify the BNNs into three different types based on how the posterior distribution is inferred to approximate Bayesian inference: • Variational inference [73], [74] Variational inference approaches approximate the (in general intractable) posterior distribution by optimizing over a family of tractable distributions.
• Sampling approaches [78] Sampling approaches deliver a representation of the target random variable from which realizations can be sampled.Such methods are based on Markov Chain Monte Carlo and further extensions.
1) Variational Inference: The goal of variational inference is to infer the posterior probabilities p(θ|x, y) using a prespecified family of distributions q(θ).Here, these so-called variational family q(θ) is defined as a parametric distribution.An example is the Multivariate Normal distribution where its parameters are the mean and the covariance matrix.The main idea of variational inference is to find the settings of these parameters that make q(θ) to be close to the posterior of interest p(θ|x, y).This measure of closeness between the probability distributions are given by the Kullback-Leibler (KL) divergence Due to the posterior p(θ|x, y) the KL-divergence in (20) can not be minimized directly.Instead, the evidence lower bound (ELBO), a function that is equal to the KL divergence up to a constant, is optimized.For a given prior distribution on the parameters p(θ), the ELBO is given by and for the KL divergence holds.
Variational methods for BNNs have been pioneered by Hinton and Van Camp [73] where the authors derived a diagonal Gaussian approximation to the posterior distribution of neural networks (couched in information theory -a minimum description length).Another notable extension in 1990s has been proposed by Barber and Bishop [74], in which the full covariance matrix was chosen as the variational family, and the authors demonstrated how the ELBO can be optimized for neural networks.Several modern approaches can be viewed as extensions of these early works [73], [74] with a focus on how to scale the variational inference to modern neural networks.An evident direction with the current methods are the use of stochastic variational inference (or Monte-Carlo variational inference), where the optimization of ELBO is performed using mini-batch of data.One of the first connections to stochastic variational inference has been proposed by Graves et al. [75] with Gaussian priors.In 2015, Blundell et al. [30] introduced Bayes By Backprop, a further extension of stochastic variational inference [75] to non-Gaussian priors and demonstrated how the stochastic gradients can be made unbiased.Notable, Kingma et al. [142] introduced the local reparameterization trick to reduce the variance of the stochastic gradients.One of the key concepts is to reformulate the loss function of neural network as the ELBO.As a result the intractable posterior distribution is indirectly optimized and variational inference is compatible to back-propagation with certain modifications to the training procedure.These extensions widely focus on the fragility of stochastic variational inference that arises due to sensitivity to initialization, prior definition and variance of the gradients.These limitations have been addressed recently by Wu et al. [143], where a hierarchical prior was used and the moments of the variational distribution are approximated deterministically.Above works commonly assumed mean-field approximations as the variational family, neglecting the correlations between the parameters.In order to make more expressive variational distributions feasible for deep neural networks, several works proposed to infer using the matrix normal distribution [144], [145], [146] or more expressive variants [147], [148] where the covariance matrix is decomposed into the Kronecker products of smaller matrices or in a low rank form plus a positive diagonal matrix.A notable contribution towards expressive posterior distributions has been the use of normalizing flows [77], [149] -a hierarchical probability distribution where a sequence of invertible transformations are applied so that a simple initial density function is transformed into a more complex distribution.Interestingly, Farquhar et al. [137] argue that mean-field approximation is not a restrictive assumption, and the layer-wise weight correlations may not be as important as capturing the depth-wise correlations.While the claim of Farquhar et al. [137] may remain to be an open question, the mean-field approximations have an advantage on smaller computational complexities [137].For example, Osawa et al. [150] demonstrated that variational inference can be scaled up to ImageNet size data-sets and architectures using multiple GPUs and proposed practical tricks such as data augmentation, momentum initialization and learning rate scheduling.One of the successes in variational methods have been accomplished by casting existing stochastic elements of deep learning as variational inference.A widely known example is Monte Carlo Dropout (MC Dropout) where the dropout layers are formulated as Bernoulli distributed random variables, and training a neural network with dropout layers can be approximated as performing variational inference [61], [20], [151].A main advantage of MC dropout is that the predictive uncertainty can be computed by activating dropout not only during training, but also at test time.In this way, once the neural network is trained with dropout layers, the implementation efforts can be kept minimum and the practitioners do not need expert knowledge to reason about uncertainty -certain criteria that the authors are attributing to its success [20].The practical values of this method has been demonstrated also in several works [152], [10], [21] and resulted in different extensions (evaluating the usage of different dropout masks for example for convolutional layers [153] or by changing the representations of the predictive uncertainty into model and data uncertainties [60]).Approaches that build upon the similar idea but randomly drop incoming activations of a node, instead of dropping an activation for all following nodes, were also proposed within the literature [37] and called drop connect.This was found to be more robust on the uncertainty representation, even though it was shown that a combination of both can lead to higher accuracy and robustness in the test predictions [154].Lastly, connections of variation inference to Adam [155], RMS Prop [156] and batch normalization [157] have been further suggested in the literature.
2) Sampling Methods: Sampling methods, or also often called Monte Carlo methods, are another family of Bayesian inference algorithms that represent uncertainty without a parametric model.Specifically, sampling methods use a set of hypotheses (or samples) drawn from the distribution and offer an advantage that the representation itself is not restricted by the type of distribution (e.g. can be multi-modal or non-Gaussian) -hence probability distributions are obtained nonparametrically.Popular algorithms within this domain are particle filtering, rejection sampling, importance sampling and Markov Chain Monte Carlo sampling (MCMC) [62].In case of neural networks, MCMC is often used since alternatives such as rejection and importance sampling are known to be inefficient for such high dimensional problems.The main idea of MCMC is to sample from arbitrary distributions by transition in state space where this transition is governed by a record of the current state and the proposal distribution that aims to estimate the target distribution (e.g. the true posterior).To explain this, let us start defining the Markov Chain: a Markov Chain is a distribution over random variables x 1 , • • • , x T which follows the state transition rule: i.e. the next state only depends on the current state and not on any other former state.In order to draw samples from the true posterior, MCMC sampling methods first generate samples in an iterative and the Markov Chain fashion.Then, at each iteration, the algorithm decides to either accept or reject the samples where the probability of acceptance is determined by certain rules.In this way, as more and more samples are produced, their values can approximate the desired distribution.
Hamiltonian Monte Carlo or Hybrid Monte Carlo (HMC) [158] is an important variant of MCMC sampling method (pioneered by Neals [78], [79], [80], [159] for neural networks), and is often known to be the gold standards of Bayesian inference [159], [160], [125].The algorithm works as follows: (i) start by initializing a set of parameters θ (either randomly or in a user-specific manner).Then, for a given number of total iterations, (ii) instead of a random walk, a momentum vector -an auxiliary variable ρ is sampled, and the current value of parameters θ is updated via the Hamiltonian dynamics: Defining the potential energy (V (θ) = −logp(θ) and the kinetic energy T (ρ|θ) = −logp(ρ|θ), the update steps via Hamilton's equations are governed by, The so-called leapfrog integrator is used as a solver [161].(iii) For each step, a Metropolis acceptance criterion is applied to either reject or accept the samples (similar to MCMC).Unfortunately, HMC requires the processing of the entire data-set per iteration, which is computationally too expensive when the data-set size grows to million to even billions.Hence, many modern algorithms focus on how to perform the computations in a mini-batch fashion stochastically.In this context, for the first time, Welling and Teh [81] proposed to combine Stochastic Gradient Descent (SGD) with Langevin dynamics (a form of MCMC [162], [163], [159]) in order to obtain a scalable approximation to MCMC algorithm based on mini-batch SGD [164], [165].The work demonstrated that performing Bayesian inference on Deep Neural Networks can be as simple as running a noisy SGD.This method does not include the momentum term of HMC via using the first order Langevin dynamics and opened up a new research area on Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC).Consequently, several extensions are available which include the use of 2nd order information such as preconditioning and optimizing with the Fisher Information Matrix (FIM) [166], [167], [168], the Hessian [169], [170], [171], adapting preconditioning diagonal matrix [172], generating samples from non-isotropic target densities using Fisher scoring [173], and samplers in the Riemannian manifold [174] using the first order Langevin dynamics and Levy diffusion noise and momentum [175].Within these methods, the so-called parameter dependent diffusion matrices are incorporated with an intention to offset the stochastic perturbation of the gradient.To do so, the "thermostat" ideas [176], [177], [178] are proposed so that a prescribed constant temperature distribution is maintained with the parameter dependent noise.Ahn et al. [179] devised a distributed computing system for SG-MCMC to exploit the modern computing routines, while Wang et al. [180] showed that Generative Adversarial Models (GANs) can be used to distill the samples for improved memory efficiency, instead of distillation for enhancing the run-time capabilities of computing predictive uncertainty [181].Lastly, other recent trends are techniques that reduce the variance [160], [182] and bias [183], [184] arising from stochastic gradients.Concurrently, there have been solid advances in theory of SG-MCMC methods and their applications in practice.Sato and Nakagawa [185], for the first time, showed that the SGLD algorithm with constant step size weakly converges; Chen et al. [186] showed that faster convergence rates and more accurate invariant measures can be observed for SG-MCMCs with higher order integrators rather than a 1st order Euler integrator while Teh et al. [187] studied the consistency and fluctuation properties of the SGLD.As a result, verifiable conditions obeying a central limit theorem for which the algorithm is consistent, and how its asymptotic bias-variance decomposition depends on step-size sequences have been discovered.A more detailed review of the SG-MCMC with a focus on supporting theoretical results can be found in Nemeth and Fearnhead [82].Practically, SG-MCMC techniques have been applied to shape classification and uncertainty quantification [188], empirically study and validate the effects of tempered posteriors (or called coldposteriors) [189] and train a deep neural network in order to generalize and avoid over-fitting [190], [191].
3) Laplace Approximation: The goal of the Laplace Approximation is to estimate the posterior distribution over the parameters of neural networks p(θ | x, y) around a local mode of the loss surface with a Multivariate Normal distribution.The Laplace Approximation to the posterior can be obtained by taking the second-order Taylor series expansion of the log posterior over the weights around the MAP estimate θ given some data (x, y).If we assume a Gaussian prior with a scalar precision value τ > 0, then this corresponds to the commonly used L 2 -regularization, and the Taylor series expansion results in where the first-order term vanishes because the gradient of the log posterior δθ = ∇ log p(θ | x, y) is zero at the maximum θ.
Taking the exponential on both sides and approximating integrals by reverse engineering densities, the weight posterior is approximately a Gaussian with the mean θ and the covariance matrix (H + τ I) −1 where H is the Hessian of log p(θ | x, y).
This means that the model uncertainty is represented by the Hessian H resulting in a Multivariate Normal distribution: In contrast to the two other methods described, the Laplace approximation can be applied on already trained networks, and is generally applicable when using standard loss functions such as MSE or cross entropy and piece-wise linear activations (e.g RELU).Mackay [123] and Denker et al. [122] have pioneered the Laplace approximation for neural networks in 1990s, and several modern methods provide an extension to deep neural networks [192], [193], [63], [84].
The core of the Laplace Approximation is the estimation of the Hessian.Unfortunately, due to the enormous number of parameters in modern neural networks, the Hessian matrices cannot be computed in a feasible way as opposed to relative smaller networks in Mackay [123] and Denker et al. [122].Consequently, several different ways for approximating H have been proposed in the literature.A brief review is as follows.Instead of diagonal approximations (e.g.[194], [83]), several researchers have been focusing on including the off-diagonal elements (e.g.[195], [196] and [197]).Amongst them, layer-wise Kronecker Factor approximation of [198], [193], [192] and [199] have demonstrated a notable scalability [200].A recent extension can be found in [201] where the authors propose to re-scale the eigen-values of the Kronecker factored matrices so that the diagonal variance in its eigenbasis is accurate.The work presents an interesting idea as one can prove that in terms of a Frobenius norm, the proposed approximation is more accurate than that of [193].However, as this approximation is harmed by inaccurate estimates of eigenvectors, Lee et al. [84] proposed to further correct the diagonal elements in the parameter space.
Existing works obtain Laplace Approximation using various approximation of the Hessian in the line of fidelity-complexity trade-offs.For several works, an approximation using the diagonal of the Fisher information matrix or Gauss Newton matrix, leading to independently distributed model weights, have been utilized in order to prune weights [202] or perform continual learning in order to avoid catastrophic forgetting [203].In Ritter et al. [63], the Kronecker factorization of the approximate block-diagonal Hessian [193], [192] have been applied to obtain scalable Laplace Approximation for neural networks.With this, the weights among different layers are still assumed to be independently distributed, but not the correlations within the same layer.Recently, building upon the current understandings of neural network's loss landscape that many eigenvalues of the Hessian tend to be zero, [84] developed a low rank approximation that leads to sparse representations of the layers' co-variance matrices.Furthermore, Lee et al. [84] demonstrated that the Laplace Approximation can be scaled to ImageNet size data-sets and architectures, and further showed that with the proposed sparsification technique, the memory complexity of modelling correlations can be made similar to the diagonal approximation.Lastly, Kristiadi et al. [204] proposed a simple procedure to compute the last-layer Gaussian approximation (neglecting the model uncertainty in all other layers of neural networks), and showed that even such a minimalist solution can mitigate overconfidence predictions of ReLU networks.
Recent efforts have extended the Laplace Approximation beyond the Hessian approximation.To tackle the widely known assumption that the Laplace Approximation is for the bell shaped true posterior and thus resulting in under-fitting behavior [63], Humt et al. [205] proposed to use Bayesian Optimization and showed that hyperparameters of the Laplace Approximation can be efficiently optimized with increaed calibration performance.Another work in this domain is by Kristiadi et al. [206], who proposed uncertainty units -a new type of hidden units that changes the geometry of the loss landscape so that more accurate inference is possible.While Shinde et al. [207] demonstrated practical effectiveness of the Laplace Approximation to the autonomous driving applications, Feng et al. [208] showed the possibility to (i) incorporate contextual information and (ii) domain adaptation in a semi-supervised manner within the context of image classification.This is achieved by designing unary potentials within a Conditional Random Field.Several real-time methods also exist that do not require multiple forward passes to compute the predictive uncertainty.So-called linearized Laplace Approximation has been proposed in [209], [210] using the ideas of Mackay [115] and have been extended with Laplace bridge for classification [211].Within this framework, Daxberger et al. [212] proposed inferring the sub-networks to increase the expressivity of covariance propagation while remaining computationally tractable.
4) Sum Up Bayesian Methods: Bayesian methods for deep learning have emerged as a strong research domain by combining principled Bayesian learning for deep neural networks.A review of current BNNs has been provided with a focus on mostly, how the posterior p(θ|x, y) is inferred.As an observation, many of the recent breakthroughs have been achieved by performing approximate Bayesian inference in a mini-batch fashion (stochastically) or investigating relatively simple but scalable techniques such as MC-dropout or Laplace Approximation.As a result, several works demonstrated that the posterior inference in large scale settings are now possible [213], [150], [84], and the field has several practical approximation tools to compute more expressive and accurate posteriors since the revival of BNNs beyond the pioneers [73], [74], [78], [122], [123].There are also emerging challenges on new frontiers beyond accurate inference techniques.Some examples are: (i) how to specify meaningful priors?[134], [135], (ii) how to efficiently marginalize over the parameters for fast predictive uncertainty?[181], [138], [211] (iii) infrastructures such as new benchmarks, evaluation protocols and software tools [214], [131], [132], [215], and (iv) towards better understandings on the current methodologies and their potential applications [137], [189], [216], [208].

C. Ensemble Methods
1) Principles of Ensemble Methods: Ensembles derive a prediction based on the predictions received from multiple socalled ensemble members.They target at a better generalization by making use of synergy effects among the different models, arguing that a group of decision makers tend to make better decisions than a single decision maker [217], [218].For an ensemble f : X → Y with members f i : X → Y for i ∈ 1, 2, ..., M , this could be for example implemented by simply averaging over the members' predictions, Based on this intuitive idea, several works applying ensemble methods to different kinds of practical tasks and approaches, as for example bioinformatics [219], [220], [221], remote sensing [222], [223], [224], or reinforcement learning [225], [226] can be found in the literature.Besides the improvement in the accuracy, ensembles give an intuitive way of representing the model uncertainty on a prediction by evaluating the variety among the member's predictions.
Compared to Bayesian and single deterministic network approaches, ensemble methods have two major differences.First, the general idea behind ensembles is relatively clear and there are not many groundbreaking differences in the application of different types of ensemble methods and their application in different fields.Hence, this section focuses on different strategies to train an ensemble and some variations that target on making ensemble methods more efficient.Second, ensemble methods were originally not introduced to explicitly handle and quantify uncertainties in neural networks.Although the derivation of uncertainty from ensemble predictions is obvious, since they actually aim at reducing the model uncertainty, ensembles were first introduced and discussed in order to improve the accuracy on a prediction [218].Therefore, many works on ensemble methods do not explicitly take the uncertainty into account.Notwithstanding this, ensembles have been found to be well suited for uncertainty estimations in neural networks [31].
2) Single-and Multi-Mode Evaluation: One main point where ensemble methods differ from the other methods presented in this paper is the number of local optima that are considered, i.e. the differentiation into single-mode and multi-mode evaluation.In order to create synergies and marginalise false predictions of single members, the members of an ensemble have to behave differently in case of an uncertain outcome.The mapping defined by a neural network is highly non-linear and hence the optimized loss function contains many local optima to which a training algorithm could converge to.Deterministic neural networks converge to one single local optimum in the solution space [227].Other approaches, e.g.BNNs, still converge to one single optimum, but additionally take the uncertainty on this local optimum into account [227].This means, that neighbouring points within a certain region around the solution also affect the loss and also influence the prediction of a test sample.Since these methods focus on single regions, the evaluation is called single-mode evaluation.In contrast to this, ensemble methods consist of several networks, which should converge to different local optima.This leads to a so called multi-mode evaluation [227].

Deterministic neural network Bayesian neural network Ensemble of neural networks
Training Inference Fig. 7: A visualization of the different evaluation behaviours of deterministic neural networks, Bayesian neural networks and the ensemble of deterministic neural networks.The x-axis indicates the network parameters θ and the y-axis represents the loss value.While the deterministic network learns the parameters based on a pointwise estimation, the Bayesian neural network also takes the surrounding of the single point into account.The ensemble of deterministic methods optimizes pointwise but learns several different parameter settings.
In Figure 7, the considered parameters of a single-mode deterministic, single-mode probabilistic (Bayesian) and multi-mode ensemble approach are visualized.The goal of multi-mode evaluation is that different local optima could lead to models with different strengths and weaknesses in the predictions such that a combination of several such models brings synergy effects improving the overall performance.
3) Bringing Variety into Ensembles: One of the most crucial points when applying ensemble methods is to maximize the variety in the behaviour among the single networks [228], [31].In order to increase the variety, several different approaches can be applied: on the performance of the already trained ensemble [62].
• Data Augmentation Augmenting the input data randomly for each ensemble member leads to models trained on different data points and therefore in general to a larger variety among the different members.
• Ensemble of different Network Architecture The combination of different network architectures leads to different loss landscapes and can therefore also increase the diversity in the resulting predictions [229].In several works, it has been shown that the variety induced by random initialization works sufficiently and that bagging could even lead to a weaker performance [230], [31].Livieris et al. [231] evaluated different bagging and boosting strategies for ensembles of weight constrained neural networks.Interestingly, it is found that bagging performs better for a small number of ensemble members while boosting performs better for a large number.Nanni et al. [232] evaluated ensembles based on different types of image augmentation for bioimage classification tasks and compared those to each other.Guo and Gould [233] used augmentation methods within in an ensemble approach for object detection.Both works stated that the ensemble approach using augmentations improves the resulting accuracy.In contrast to this, [234], [235] stated with respect to uncertainty quantification that image augmentation can harm the calibration of an ensemble and post-processing calibration methods have to be slightly adapted when using ensemble methods.Other ways of inducing variety for specific tasks have been also introduced.For instance, in [236], the members are trained with different attention masks in order to focus on different parts of the input data.Other approaches focused on the training process and introduced learning rate schedulers that are designed to discover several local optima within one training process [86], [237].Following, an ensemble can be built based on local optima found within one single training run.It is important to note that if not explicitly stated, the works and approaches presented so far targeted on improvements in the predictive accuracy and did not explicitly consider uncertainty quantification.

4) Ensemble Methods and Uncertainty Quantification:
Besides the improvement in the accuracy, ensembles are widely used for modelling uncertainty on predictions of complex models, as for example in climate prediction [238], [239].Accordingly, ensembles are also used for quantifying the uncertainty on a deep neural network's prediction, and over the last years they became more and more popular for such tasks [31], [228].Lakshminarayanan et al. [31] are often referenced as a base work on uncertainty estimations derived from ensembles of neural networks and as a reference for the competitiveness of deep ensembles.They introduced an ensemble training pipeline to quantify predictive uncertainty within DNNs.In order to handle data and model uncertainty, the member networks are designed with two heads, representing the prediction and a predicted value of data uncertainty on the prediction.The approach is evaluated with respect to accuracy, calibration, and outof-distribution detection for classification and regression tasks.In all tests, the method performs at least equally well as the BNN approaches used for comparison, namely Monte Carlo Dropout and Probabilistic Backpropagation.Lakshminarayanan et al. [31] also showed that shuffling the training data and a random initialization of the training process induces a sufficient variety in the models in order to predict the uncertainty for the given architectures and data sets.Furthermore, bagging is even found to worsen the predictive uncertainty estimation, extending the findings of Lee et al. [230], who found bagging to worsen the predictive accuracy of ensemble methods on the investigated tasks.Gustafsson et al. [56] introduced a framework for the comparison of uncertainty quantification methods with a specific focus on real life applications.Based on this framework, they compared ensembles and Monte Carlo dropout and found ensembles to be more reliable and applicable to real life applications.These findings endorse the results reported by Beluch et al. [240] who found ensemble methods to deliver more accurate and better calibrated predictions on active learning tasks than Monte Carlo Dropout.Ovadia et al. [13] evaluated different uncertainty quantification methods based on test sets affected by distribution shifts.The excessive evaluation contains a variety of model types and data modalities.As a take away, the authors stated that already for a relatively small ensemble size of five, deep ensembles seem to perform best and are more robust to data set shifts than the compared methods.Vyas et al. [241] presented an ensemble method for the improved detection of out-of-distribution samples.For each member, a subset of the training data is considered as out-of-distribution.For the training process, a loss, seeking a minimum margin greater zero between the average entropy of the in-domain and the out-of-distribution subsets is introduced and leads to a significant improvement in the out-of-distribution detection.

5) Making
Ensemble Methods more Efficient: Compared to single model methods, ensemble methods come along with a significantly increased computational effort and memory consumption [217], [45].When deploying an ensemble for a real life application the available memory and computational power are often limited.Such limitations could easily become a bottleneck [242] and could become critical for applications with limited reaction time.Reducing the number of models leads to less memory and computational power consumption.Pruning approaches reduce the complexity of ensembles by pruning over the members and reducing the redundancy among them.For that, several approaches based on different diversity measures are developed to remove single members without strongly affecting the performance [88], [87], [243].
Distillation is another approach where the number of networks is reduced to one single model.It is the procedure of teaching a single network to represent the knowledge of a group of neural networks [244].First works on the distillation of neural networks were motivated by restrictions when deploying large scale classification problems [244].The original classification problem is separated into several sub-problems focusing on single blocks of classes that are difficult to differentiate.Several smaller trainer networks are trained on the sub-problems and then teach one student network to separate all classes at the same time.In contrast to this, Ensemble distillation approaches capture the behaviour of an ensemble by one single network.First works on ensemble distillation used the average of the softmax outputs of the ensemble members in order to teach a student network the derived predictive uncertainty [245].Englesson and Azizpour [246] justify the resulting predictive distributions of this approach and additionally cover the handling of outof-distribution samples.When averaging over the members' outputs, the model uncertainty, which is represented in the variety of ensemble outputs, gets lost.To overcome this drawback, researchers applied the idea of learning higher order distributions, i.e. distributions over a distribution, instead of directly predicting the output [90], [45].The members are then distillated based on the divergence from the average distribution.The idea is closely related to the prior networks [32] and the evidential neural networks [44], which are described in Section III-A.[45] modelled ensemble members and the distilled network as prior networks predicting the parameters of a Dirichlet distribution.The distillation then seeks to minimize the KL divergence between the averaged Dirichlet distributions of the ensemble members and the output of the distilled network.Lindqvist et al. [90] generalized this idea to any other parameterizable distribution.With that, the method is also applicable to regression problems, for example by predicting a mean and standard deviation to describe a normal distribution.Within several tests, the distillation models generated by these approaches are able to distinguish between data uncertainty and model uncertainty.Although distillation methods cannot completely capture the behaviour of an underlying ensemble, it has been shown that they are capable of delivering good and for some experiments even comparable results [90], [45], [247].Other approaches, as sub-ensembles [39] and batch-ensembles [40] seek to reduce the computation effort and memory consumption by sharing parts among the single members.It is important to note that the possibility of using different model architectures for the ensemble members could get lost when parts of the ensembles are shared.Also, the training of the models cannot be run in a completely independent manner.Therefore, the actual time needed for training does not necessarily decrease in the same way as the computational effort does.Sub-ensembles [39] divide a neural network architecture into two sub-networks.The trunk network for the extraction of general information from the input data, and the task network that uses these information to fulfill the actual task.In order to train a sub-ensemble, first, the weights of each member's trunk network are fixed based on the resulting parameters of one single model's training process.Following, the parameters of each ensemble members' task network are trained independently from the other members.As a result, the members are built with a common trunk and an individual task sub-network.Since the training and the evaluation of the trunk network have to be done only once, the number of computations needed for training and testing decreases by the factor , where N task , N trunk , and N stand for the number of variables in the task networks, the trunk network, and the complete network.Valdenegro-Toro [39] further underlined the usage of a shared trunk network by arguing that the trunk network is in general computational more costly than the task network.In contrast to this, batch-ensembles [40] connect the member networks with each other at every layer.The ensemble members' weights are described as a Hadamard product of one shared weight matrix W ∈ R n×m and M individual rank one matrices F i ∈ R n×m , each linked with one of the M ensemble members.The rank one matrices can be written as a multiplication F i = r i s T i of two vectors s ∈ R n and r ∈ R m and hence the matrix F i can be described by n + m parameters.With this approach, each additional ensemble member increases the number of parameters only by the factor M .On the one hand, with this approach, the members are not independent anymore such that all the members have to be trained in parallel.On the other hand, the authors also showed that the parallelization can be realized similar to the optimization on mini-batches and on a single unit.
6) Sum Up Ensemble Methods: Ensemble methods are very easy to apply, since no complex implementation or major modification of the standard deterministic model have to be realized.Furthermore, ensemble members are trained independently from each other, which makes the training easily parallelizable.Also, trained ensembles can be extended easily, but the needed memory and the computational effort increases linearly with the number of members for training and evaluation.The main challenge when working with ensemble methods is the need of introducing diversity among the ensemble members.For accuracy, uncertainty quantification, and outof-distribution detection, random initialization, data shuffling, and augmentations have been found to be sufficient for many applications and tasks [31], [232].Since these methods may be applied anyway, they do not need much additional effort.The independence of the single ensemble members leads to a linear increase in the required memory and computation power with each additional member.This holds for the training as well as for testing.This limits the deployment of ensemble methods in many practical applications where the computation power or memory is limited, the application is time-critical, or very large networks with high inference time are included [45].Many aspects of ensemble approaches are only investigated with respect to the performance on the predictive accuracy but do not take predictive uncertainty into account.This also holds for the comparison of different training strategies for a broad range of problems and data sets.Especially since the overconfidence from single members can be transferred to the whole ensemble, strategies that encourage the members to deliver different false predictions instead of all delivering the same false prediction should be further investigated.For a better understanding of ensemble behavior, further evaluations of the loss landscape, as done by Fort et al. [227], could offer interesting insights.

D. Test Time Augmentation
Inspired by ensemble methods and adversarial examples [14], the test time data augmentation is one of the simpler predictive uncertainty estimation techniques.The basic method is to create multiple test samples from each test sample by applying data augmentation techniques on it and then test all those samples to compute a predictive distribution in order to measure uncertainty.The idea behind this method is that the augmented test samples allow the exploration of different views and is therefore capable of capturing the uncertainty.Mostly, this technique of test time data augmentations has been used in medical image processing [248], [249], [14], [250].One of the reasons for this is that the field of medical image processing already makes heavy use of data augmentations while using deep learning [251], so it is quite easy to just apply those same augmentations during test time to calculate the uncertainties.Another reason is that collecting medical images is costly, thus forcing practitioners to rely on data augmentation techniques.Moshkov et al. [250] used the test time augmentation technique for cell segmentation tasks.For that, they created multiple variations of the test data before feeding it to a trained UNet or Mask R-CNN architecture.Following, they used a majority voting to create the final output segmentation mask and discuss the policies of applying different augmentation techniques and how they affect the final predictive results of the deep networks.Overall, test time augmentation is an easy method for estimating uncertainties because it keeps the underlying model unchanged, requires no additional data, and is simple to put into practice with off-the-shelf libraries.Nonetheless, it needs to be kept in mind that during applying this technique, one should only apply valid augmentations to the data, meaning that the augmentations should not generate data from outside the target distribution.According to [252], test time augmentation can change many correct predictions into incorrect predictions (and vice versa) due to many factors such as the nature of the problem at hand, the size of training data, the deep neural network architecture, and the type of augmentation.To limit the impact of these factors, Shanmugam et al. [252] proposed a learning-based method for test time augmentation that takes these factors into consideration.In particular, the proposed method learns a function that aggregates the predictions from each augmentation of a test sample.Similar to [252], Molchanov et al. [91] proposed a method, named "greedy Policy Search", for constructing a test-time augmentation policy by choosing augmentations to be include in a fixedlength policy.Similarly, Kim et al. [253] proposed a method for learning a loss predictor from the training data for instanceaware test-time augmentation selection.The predictor selects test-time augmentations with the lowest predicted loss for a given sample.
Although learnable test time augmentation techniques [252], [91], [253] help to select valid augmentations, one of the major open question is to find out the effect on uncertainty due to different kinds of augmentations.It can for example happen that a simple augmentation like reflection is not able to capture much of the uncertainty while some domain specialized stretching and shearing captures more uncertainty.It is also important to find out how many augmentations are needed to correctly quantify uncertainties in a given task.This is particularly important in applications like earth observation, where inference might be needed on global scale with limited resources.

E. Neural Network Uncertainty Quantification Approaches for Real Life Applications
In order to use the presented methods on real life tasks, several different considerations have to be taken into account.The memory and computational power is often restricted while many real world tasks my be time-critical [242].An overview over the main properties is given in Table I.The presented applications all come along with advantages and disadvantages, depending on the properties a user is interested in.While ensemble methods and test-time augmentation methods are relatively easy to apply, Bayesian approaches deliver a clear description of the uncertainty on the models parameters and also deliver a deeper theoretical basis.The computational effort and memory consumption is a common restriction on real life applications, where single deterministic network approaches perform best, but distillation of ensembles or efficient Bayesian methods can also be taken into consideration.Within the different types of Bayesian approaches, the performance, the computational effort, and the implementation effort still vary strongly.Laplace approximations are relatively easy to apply and compared to sampling approaches much less computational effort is needed.Furthermore, there often already exist pretrained networks for an application.In this case, Laplace Approximation and external deterministic single network approaches can in general be applied to already trained networks.
Another important aspect that has to be taken into account for uncertainty quantification in real life applications is the source and type of uncertainty.For real life applications, out-of-distribution detection forms the maybe most important challenge in order to avoid unexpected decisions of the network and to be aware of adversarial attacks.Especially since many motivations of uncertainty quantification are given by risk minimization, methods that deliver risk averse predictions are an important field to evaluate.Many works already demonstrated the capability of detecting out-of-distribution samples on several tasks and built a strong fundamental tool set for the deployment in real life applications [254], [241], [255], [56].However, in real life, the tasks are much more difficult than finding out-ofdistribution samples among data sets (e.g., MNIST or CIFAR data sets etc.) and the main challenge lies in comparing such approaches on several real-world data sets against each other.The work of Gustafsson et al. [56] forms a first important step towards an evaluation of methods that better suits the demands in real life applications.Interestingly, they found for their tests ensembles to outperform the considered Bayesian approaches.This indicates, that the multi-mode evaluation given by ensembles is a powerful property for real life applications.Nevertheless Bayesian approaches have delivered strong results as well and furthermore come along with a strong theoretical foundation [84], [211], [6], [23].As a way to go, the combination of efficient ensemble strategies and Bayesian approaches could combine the variability in the model parameters while still considering several modes for a prediction.Also, single deterministic approaches as the prior networks [32], [64], [44], [33] deliver comparable results while consuming significantly less computation power.However, this efficiency often comes along with the problem that separated sets of in-and out-of-distribution samples have to be available for the training process [33], [64].In general, the development of new problem and loss formulations as for example given in [64] leads to a better understanding and description of the underlying problem and forms an important field of research.

IV. UNCERTAINTY MEASURES AND QUALITY
In Section III, we presented different methods for modeling and predicting different types of uncertainty in neural networks.In order to evaluate these approaches, measures have to be applied on the derived uncertainties.In the following, we present different measures for quantifying the different predicted types of uncertainty.In general, the correctness and trustworthiness of these uncertainties is not automatically given.In fact, there are several reasons why evaluating the quality of the uncertainty estimates is a challenging task.
• First, the quality of the uncertainty estimation depends on the underlying method for estimating uncertainty.This is exemplified in the work undertaken by Yao et al. [256], which shows that different approximates of Bayesian inference (e.g.Gaussian and Laplace approximates) result in different qualities of uncertainty estimates.
• Second, there is a lack of ground truth uncertainty estimates [31] and defining ground truth uncertainty estimates is challenging.For instance, if we define the ground truth uncertainty as the uncertainty across human subjects, we still have to answer questions as "How many subjects do we need?" or "How to choose the subjects?".
• Third, there is a lack of a unified quantitative evaluation metric [257].To be more specific, the uncertainty is defined differently in different machine learning tasks such as classification, segmentation, and regression.For instance, prediction intervals or standard deviations are used to represent uncertainty in regression tasks, while entropy (and other related measures) are used to capture uncertainty in classification and segmentation tasks.

A. Evaluating Uncertainty in Classification Tasks
For classification tasks, the network's softmax output already represents a measure of confidence.But since the raw softmax output is neither very reliable [67] nor can it represent all sources of uncertainty [19], further approaches and corresponding measures were developed.
1) Measuring Data Uncertainty in Classification Tasks: Consider a classification task with K different classes and a probability vector network output p(x) for some input sample x.In the following p is used for simplification and p k stands for the k-th entry in the vector.In general, the given prediction p represents a categorical distribution, i.e. it assigns a probability to each class to be the correct prediction.Since the prediction is not given as an explicit class but as a probability distribution, (un)certainty estimates can be directly derived from the prediction.In general this pointwise prediction can be seen as estimated data uncertainty [60].However, as discussed in Section II, the model's estimation of the data uncertainty is affected by model uncertainty, which has to be taken into account separately.In order to evaluate the amount of predicted data uncertainty, one can for example apply the maximal class probability or the entropy measures: Maximal probability: Entropy: The maximal probability represents a direct representation of certainty, while entropy describes the average level of information in a random variable.Even though a softmax output should represent the data uncertainty, one cannot tell from a single prediction how large the amount of model uncertainty is that affects this specific prediction as well.
2) Measuring Model Uncertainty in Classification Tasks: As already discussed in Section III, a single softmax prediction is not a very reliable way for uncertainty quantification since it is often badly calibrated [19] and does not have any information about the certainty the model itself has on this specific output [19].An (approximated) posterior distribution p(θ|D) on the learned model parameters can help to receive better uncertainty estimates.With such a posterior distribution, the softmax output itself becomes a random variable and one can evaluate its variation, i.e. uncertainty.For simplicity, we denote p(y|θ, x) also as p and it will be clear from context whether p depends on θ or not.The most common measures for this are the mutual information (MI), the expected Kullback-Leibler Divergence (EKL), and the predictive variance.Basically, all these measures compute the expected divergence between the (stochastic) softmax output and the expected softmax output The MI uses the entropy to measure the mutual dependence between two variables.In the described case, the difference between the information given in the expected softmax output and the expected information in the softmax output is compared, i.e.
Smith and Gal [19] pointed out that the MI is minimal when the knowledge about model parameters does not increase the information in the final prediction.Therefore, the MI can be interpreted as a measure of model uncertainty.
The Kullback-Leibler divergence measures the divergence between two given probability distributions.The EKL can be used to measure the (expected) divergence among the possible softmax outputs, which can also be interpreted as a measure of uncertainty on the model's output and therefore represents the model uncertainty.
The predictive variance evaluates the variance on the (random) softmax outpus, i.e.
As described in Section III, an analytically described posterior distribution p(θ|D) is only given for a subset of the Bayesian methods.And even for an analytically described distribution, the propagation of the parameter uncertainty into the prediction is in almost all cases intractable and has to be approximated for example with Monte Carlo approximation.Similarly, ensemble methods collect predictions from M neural networks, and test-time data augmentation approaches receive M predictions from M different augmentations applied to the original input sample.For all these cases, we receive a set of M samples, p i M i=1 , which can be used to approximate the intractable or even undefined underlying distribution.With these approximations, the measures defined in ( 31), (32), and (33) can be applied straight forward and only the expectation has to be replaced by average sums.For example, the expected softmax output becomes For the expectations given in ( 31), (32), and ( 33), the expectation is approximated similarly.
3) Measuring Distributional Uncertainty in Classification Tasks: Although these uncertainty measures are widely used to capture the variability among several predictions derived from Bayesian neural networks [60], ensemble methods [31], or test-time data augmentation methods [14], they cannot capture distributional shifts in the input data or out-of-distribution examples, which could lead to a biased inference process and a falsely stated confidence.If all predictors attribute a high probability mass to the same (false) class label, this induces a low variability among the estimates.Hence, the network seams to be certain about its prediction, while the uncertainty in the prediction itself (given by the softmax probabilities) is also evaluated to be low.To tackle this issue, several approaches described in Section III take the magnitude of the logits into account, since a larger logit indicates larger evidence for the corresponding class [44].Thus, the methods either interpret the total sum of the (exponentials of) the logits as precision value of a Dirichlet distribution (see description of Dirichlet Priors in Section III-A) [32], [94], [64], or as a collection of evidence that is compared to a defined constant [44], [92].One can also derive a total class probability for each class individually by applying the sigmoid function to each logit [105].Based on the class-wise total probabilities, OOD samples might easier be detected, since all classes can have low probability at the same time.Other methods deliver an explicit measure how well new data samples suit into the training data distribution.Based on this, they also give a measure that a sample will be predicted correctly [36].
4) Performance Measure on Complete Data Set: While the measures described above measure the performance of individual predictions, others evaluate the usage of these measures on a set of samples.Measures of uncertainty can be used to separate between correctly and falsely classified samples or between in-domain and out-of-distribution samples [67].For that, the samples are split into two sets, for example in-domain and out-of-distribution or correctly classified and falsely classified.The two most common approaches are the Receiver Operating Characteristic (ROC) curve and the Precision-Recall (PR) curve.Both methods generate curves based on different thresholds of the underlying measure.For each considered threshold, the ROC curve plots the true positive rate against the false positive rate 3 , and the PR curve plots the precision against the recall 4 .While the ROC and PR curves give a visual idea of how well the underlying measures are suited to separate the two considered test cases, they do not give a qualitative measure.To reach this, the area under the curve (AUC) can be evaluated.Roughly speaking, the AUC gives a probability value that a randomly chosen positive sample leads to a higher measure than a randomly chosen negative example.For example, the maximum softmax values measure ranks of correctly classified examples higher than falsely classified examples.Hendrycks and Gimpel [67] showed for several application fields that correct predictions have in general a higher predicted certainty in the softmax value than false predictions.Especially for the evaluation of in-domain and out-of-distribution examples, the Area Under Receiver Operating Curve (AUROC) and the Area Under Precision Recall Curce (AUPRC) are commonly used [64], [32], [94].The clear weakness of these evaluations is the fact that the performance is evaluated and the optimal threshold is computed based on a given test data set.A distribution shift from the test set distribution can ruin the whole performance and make the derived thresholds impractical.

1) Measuring Data Uncertainty in Regression Predictions:
In contrast to classification tasks, where the network typically outputs a probability distritution over the possible classes, regression tasks only predict a pointwise estimation without any hint of data uncertainty.As already described in Section III, a common approach to overcome this is to let the network predict the parameters of a probability distribution, for example a mean vector and a standard deviation for a normally distributed uncertainty [31], [60].Doing so, a measure of data uncertainty is directly given.The prediction of the standard deviation allows an analytical description that the (unknown) true value is within a specific region.The interval that covers the true value with a probability of α (under the assumption that the predicted distribution is correct) is given by where Φ −1 is the quantile function, the inverse of the cumulative probability function.For a given probability value α the quantile function gives a boundary, such that 100 • α% of a standard normal distribution's probability mass is on values smaller than Φ −1 (α).Quantiles assume some probability distribution and interpret the given prediction as the expected value of the distribution.
In contrast to this, other approaches [259], [260] directly predict a so called prediction interval (PI) in which the prediction is assumed to lay.Such intervals induce an uncertainty as a uniform distribution without giving a concrete prediction.The certainty of such approaches can, as the name indicates, be directly measured by the size of the predicted interval.The Mean Prediction Interval Width (MPIW) can be used to evaluate the average certainty of the model [259], [260].In order to evaluate the correctness of the predicted intervals the Prediction Interval Coverage Probability (PICP) can be applied [259], [260].The PCIP represents the percentage of test predictions that fall into a prediction interval and is defined as where n is the total number of predictions and c the number of ground truth values that are actually captured by the predicted intervals.

2) Measuring Model Uncertainty in Regression Predictions:
In Section II, it is described, that model uncertainty is mainly caused by the model's architecture, the training process, and underrepresented areas in the training data.Hence, there is no real difference in the causes and effects of model uncertainty between regression and classification tasks such that model uncertainty in regression tasks can be measured equivalently as already described for classification tasks, i.e. in most cases by approximating an average prediction and measuring the divergence among the single predictions [60].
In the context of segmentation, the uncertainty in pixel wise segmentation is measured using confidence intervals [4], [152], the predictive variance [262], [264], the predictive entropy [2], [249], [261], [263] or the mutual information [1].The uncertainty in structure (volume) estimation is obtained by averaging over all pixel-wise uncertainty estimates [264], [261].The quality of volume uncertainties is assessed by evaluating the coefficient of variation, the average Dice score or the intersection over union [2], [249].These metrics measure the agreement in area overlap between multiple estimates in a pairwise fashion.Ideally, a false segmentation should result in an increase in pixel-wise and structure uncertainty.To evaluate whether this is the case, Nair et al. [1] evaluated the pixellevel true positive rate and false detection rate as well as the ROC curves for the retained pixels at different uncertainty thresholds.Similar to [1], McClure et al. [261] also analyzed the area under the ROC curve.

V. CALIBRATION
A predictor is called well-calibrated if the derived predictive confidence represents a good approximation of the actual probability of correctness [15].Therefore, in order to make use of uncertainty quantification methods, one has to be sure that the network is well calibrated.Formally, for classification tasks a neural network f θ is calibrated [265] if it holds that ∀p ∈ [0, 1] : Here, I{•} is the indicator function that is either 1 if the condition is true or 0 if it is false and y i,k is the k-th entry in the one-hot encoded groundtruth vector of a training sample (x i , y i ).This formulation means that for example 30% of all predictions with a predictive confidence of 70% should actually be false.For regression tasks the calibration can be defined such that predicted confidence intervals should match the confidence intervals empirically computed from the data set [265], i.e.
∀p ∈ [0, 1] : where conf p is the confidence interval that covers p percent of a distribution.A DNN is called under-confident if the left hand side of ( 37) and (38) are larger than p. Equivalently, it is under-confident if the terms are smaller than p.The calibration property of a DNN can be visualized using a reliability diagram, as shown in Figure 8.
In general, calibration errors are caused by factors related to model uncertainty [15].This is intuitively clear, since as discussed in Section II, data uncertainty represents the underlying uncertainty that an input x and a target y represent the same real world information.Following, correctly predicted data uncertainty would lead to a perfectly calibrated neural network.In practice, several works pointed out that deeper networks tend to be more overconfident than shallower ones [15], [266], [267].
Several methods for uncertainty estimation presented in Section III also improve the networks calibration [31], [20].This is clear, since these methods quantify model and data uncertainty separately and aim at reducing the model uncertainty on the predictions.Besides the methods that improve the calibration by reducing the model uncertainty, a large and growing body of literature has investigated methods for explicitly reducing calibration errors.These methods are presented in the following, followed by measures to quantify the calibration error.It is important to note that these methods do not reduce the model uncertainty, but propagate the model uncertainty onto the representation of the data uncertainty.For example, if a binary classifier is overfitted and predicts all samples of a test set as class A with probability 1, while half of the test samples are actually class B, the recalibration methods might map the network output to 0.5 in order to have a reliable confidence.This probability of 0.5 is not equivalent to the data uncertainty but represents the model uncertainty propagated onto the predicted data uncertainty.

A. Calibration Methods
Calibration methods can be classified into three main groups according to the step when they are applied: • Regularization methods applied during the training phase [268], [269], [11], [270], [271] These methods modify the objective, optimization and/or regularization procedure in order to build DNNs that are inherently calibrated.
• Post-processing methods applied after the training process of the DNN [15], [47] These methods require a held-out calibration data set to adjust the prediction scores for recalibration.They only work under the assumption that the distribution of the left-out validation set is equivalent to the distribution, on which inference is done.Hence, also the size of the validation data set can influence the calibration result.
• Neural network uncertainty estimation methods Approaches, as presented in Section III, that reduce the amount of model uncertainty on a neural network's confidence prediction, also lead to a better calibrated predictor.This is because the remaining predicted data uncertainty better represents the actual uncertainty on the prediction.Such methods are based for example on Bayesian methods [272], [209], [273], [274], [16] or deep ensembles [31], [275].In the following, we present the three types of calibration methods in more detail.
1) Regularization Methods: Regularization methods for calibrating confidences manipulate the training of DNNs by modifying the objective function or by augmenting the training data set.The goal and idea of regularization methods is very similar to the methods presented in Section III-A where the methods mainly quantify model and data uncertainty separately within a single forward pass.However, the methods in Section III-A quantify the model and data uncertainty, while these calibration methods are regularized in order to minimize the model uncertainty.Following, at inference, the model uncertainty cannot be obtained anymore.This is the main motivation for us to separate the approaches presented below from the approaches presented in Section III-A.One popular regularization based calibration method is label smoothing [268].For label smoothing, the labels of the training examples are modified by taking a small portion α of the true class' probability mass and assign it uniformly to the false classes.For hard, non-smoothed labels, the optimum cannot be reached in practice, as the gradient of the output with respect to the logit vector z, can only converge to zero with increasing distance between the true and false classes' logits.As a result, the logits of the correct class are much larger than the logits for the incorrect classes and the logits of the incorrect classes can be very different to each other.Label-smoothing avoids this and while it generally leads to a higher training loss, the calibration error decreases and the accuracy often increases as well [270].Seo et al. [266] extended the idea of label smoothing and directly aimed at reducing the model uncertainty.For this, they sampled T forward passes of a stochastic neural network already at training time.Based on the T forward passes of a training sample (x i , y i ), a normalized model variance α i is derived as the mean of the Bhattacharyya coefficients [281] between the T individual predictions ŷ1 , ..., ŷT and the average prediction ȳ = 1 T T t=1 ŷt , Based on this α i , Seo et al. [266] introduced the varianceweighted confidence-integrated loss function that is a convex combination of two contradictive loss functions, where GT is the mean cross-entropy computed for the training sample x i with given ground-truth y i .L U represents the mean KL-divergence between a uniform target probability vector and the computed prediction.The adaptive smoothing parameter α i pushes predictions of training samples with high model uncertainty (given by high variances) towards a uniform distribution while increasing the prediction scores of samples with low model uncertainty.As a result, variances in the predictions of a single sample are reduced and the network can then be applied with a single forward pass at inference.Pereyra et al. [269] combated the overconfidence issue by adding the negative entropy to the standard loss function and therefore a penalty that increases with the network's predicted confidence.This results in the entropy-based objective function L H , which is defined as where H(ŷ i ) is the entropy of the output and α i a parameter that controls the strength of the entropy-based confidence penalty.The parameter α i is computed equivalently as for the VWCI loss.Instead of regularizing the training process by modifying the objective function, Thulasidasan et al. [278] regularized it by using a data-agnostic data augmentation technique named mixup [282].In mixup training, the network is not only trained on the training data, but also on virtual training samples (x, ỹ) generated by a convex combination of two random training pairs (x i , y i ) and (x j , y j ), i.e.
with mixup.Maroñas et al. [279] see mixup training among the most popular data augmentation regularization techniques due to its ability to improve the calibration as well as the accuracy.However, they argued that in mixup training the data uncertainty in mixed inputs affects the calibration and therefore mixup does not necessarily improve the calibration.They also underlined this claim empirically.Similarly, Rahaman and Thiery [234] experimentally showed that the distributional-shift induced by data augmentation techniques such as mixup training can negatively affect the confidence calibration.Based on this observation, Maroñas et al. [279] proposed a new objective function that explicitly takes the calibration performance on the unmixed input samples into account.Inspired by the expected calibration error (ECE, see Section V-B) Naeini et al. [283] measured the calibration performance on the unmixed samples for each batch b by the differentiable squared differences between the batch accuracy and the mean confidence on the batch samples.The total loss is given as a weighted combination of the original loss on mixed and unmixed samples and the calibration measure evaluated only on the unmixed samples:  [280], Hendrycks et al. [277] showed that exposing classifiers to out-of-distribution examples at training can help to improve the calibration.
2) Post-Processing Methods: Post-processing (or post-hoc) methods are applied after the training process and aim at learning a re-calibration function.For this, a subset of the training data is held-out during the training process and used as a calibration set.The re-calibration function is applied to the network's outputs (e.g. the logit vector) and yields an improved calibration learned on the left-out calibration set.Zhang et al. [48] discussed three requirements that should be satisfied by post-hoc calibration methods.They should 1) preserve the accuracy, i.e. should not affect the predictors performance.2) be data efficient, i.e.only a small fraction of the training data set should be left out for the calibration.3) be able to approximate the correct re-calibration map as long as there is enough data available for calibration.Furthermore, they pointed out that none of the existing approaches fulfills all three requirements.For classification tasks, the most basic but still very efficient way of post-hoc calibration is temperature scaling [15].For temperature scaling, the temperature T > 0 of the softmax function is optimized.For T = 1 the function remains the regular softmax function.For T > 1 the output changes such that its entropy increases, i.e. the predicted confidence decreases.For T ∈ (0, 1) the entropy decreases and following, the predicted confidence increases.As already mentioned above, a perfect calibrated neural network outputs MAP estimates.Since the learned transformation can only affect the uncertainty, the log-likelihood based losses as cross-entropy do not have to be replaced by a special calibration loss.While the data efficiency and the preservation of the accuracy is given, the expressiveness of basic temperature scaling is limited [48].To overcome this, Zhang et al. [48] investigated an ensemble of several temperature scaling Doing so, they achieved better calibrated predictions, while preserving the classification accuracy and improving the data efficiency and the expressive power.Kull et al. [284] were motivated by non-neural network calibration methods, where the calibration is performed classwise as a one-vs-all binary calibration.They showed that this approach can be interpreted as learning a linear transformation of the predicted log-likelihoods followed by a softmax function.This again is equivalent to train a dense layer on the log-probabilities and hence the method is also very easy to implement and apply.Obviously, the original predictions are not guaranteed to be preserved.Analogous to temperature scaling for classification networks, Levi et al. [285] introduced standard deviation scaling (std-scaling) for regression networks.As the name indicates, the method is trained to rescale the predicted standard deviations of a given network.Equivalently to the motivation of optimizing temperature scaling with the cross-entropy loss, std-scaling can be trained using the Gaussian log-likelihood function as loss, which is in general also used for the training of regression networks, which also give a prediction for the data uncertainty.
Wenger et al. [47] proposed a Gaussian process (GP) based method, which can be used to calibrate any multi-class classifier that outputs confidence values and presented their methodology by calibrating neural networks.The main idea behind their work is to learn the calibration map by a Gaussian process that is trained on the networks confidence predictions and the corresponding ground-truths in the left out calibration set.For this approach, the preservation of the predictions is also not assured.
3) Calibration with Uncertainty Estimation Approaches: As already discussed above, removing the model uncertainty and receiving an accurate estimation of the data uncertainty leads to a well calibrated predictor.Following several works based on deep ensembles [31], [275] and BNNs, [272], [209], [204] also compared their performance to other methods based on the resulting calibration.Lakshminarayanan et al. [31] and Mehrtash et al. [275] reported an improved calibration by applying deep ensembles compared to single networks.However, Rahaman and Thiery [234] showed that for specific configurations as the usage of mixup-regularization, deep ensembles can even increase the calibration error.On the other side they showed that applying temperature scaling on the averaged predictions can give a significant improvement on the calibration.For the Bayesian approaches, [204] showed that restricting the Bayesian approximation to the weights of the last fully connected layer of a DNN is already enough to improve the calibration significantly.Zhang et al. [273] and Laves et al. [274] showed that confidence estimates computed with MC dropout can be poorly calibrated.To overcome this, Zhang et al. [273] proposed structured dropout, which consists of dropping channel, blocks or layers, to promote model diversity and reduce calibration errors.

B. Evaluating Calibration Quality
Evaluating calibration consists of measuring the statistical consistency between the predictive distributions and the observations [286].For classification tasks, several calibration measures are based on binning.For that, the predictions are ordered by the predicted confidence pi and grouped into M bins b 1 , ..., b M .Following, the calibration of the single bins is evaluated by setting the average bin confidence where ŷs , y s and ps refer to the predicted and true class label of a sample s.As noted in [15], confidences are wellcalibrated when for each bin acc(b m ) = conf(b m ).For a visual evaluation of a model's calibration, the reliability diagram introduced by [287] is widely used.For a reliability diagram, the conf(b m ) is plotted against acc(b m ).For a well-calibrated model, the plot should be close to the diagonal, as visualized in Figure 8.The basic reliability diagram visualization does not distinguish between different classes.In order to do so and hence to improve the interpretability of the calibration error, Vaicenavicius et al. [286] used an alternative visualization named multidimensional reliability diagram.
For the ECE, only the predicted confidence score (top-label) is considered.In contrast to this, the Static Calibration Error (SCE) [288], [289] considers the predictions of all classes (alllabels).For each class, the SCE computes the calibration error within the bins and then averages across all the bins, i.e.
Here conf (b m k ) and acc(b m k ) are the confidence and accuracy of bin b m for class label k, respectively.Nixon et al. [288] empirically showed that all-labels calibration measures such as the SCE are more effective in assessing the calibration error than the top-label calibration measures as the ECE.
In contrast to the ECE and SCE, which group predictions into M equally-spaced bins (what in general leads to different numbers of evaluation samples per bin), the adaptive calibration error [288], [289] adaptively groups predictions into R bins with different width but equal number of predictions.With this adaptive bin size, the adaptive Expected Calibration Error (aECE) and the adaptive Static Calibration Error (aSCE) are defined as extensions of the ECE and the SCE.
As has been empirically shown in [280] and [288], the adaptive binning calibration measures aECE and aSCE are more robust to the number of bins than the corresponding equal-width binning calibration measures ECE and SCE.
It is important to make clear that in a multi-class setting, the calibration measures can suffer from imbalance in the test data.Even when then calibration is computed classwise, the computed errors are weighted by the number of samples in the classes.Following, larger classes can shadow the bad calibration on small classes, comparable to accuracy values in classification tasks [290].

VI. DATA SETS AND BASELINES
In this section, we collect commonly used tasks and data sets for evaluating uncertainty estimation among existing works.Besides, a variety of baseline approaches commonly used as comparison against the methods proposed by the researchers are also presented.By providing a review on the relevant information of these experiments, we hope that both researchers and practitioners can benefit from it.While the former can gain a basic understanding of recent benchmarks tasks, data sets and baselines so that they can design appropriate experiments to validate their ideas more efficiently, the latter might use the provided information to select more relevant approaches to start based on a concise overview on the tasks and data sets on which the approach has been validated.
In the following, we will introduce the data sets and baselines summarized in table IV according to the taxonomy used throughout this review.
The structure of the table is designed to organize the main contents of this section concisely, hoping to provide a clear overview of the relevant works.We group the approaches of each category into one of four blocks and extract the most commonly used tasks, data sets and provided baselines for each column respectively.The corresponding literature is listed at the bottom of each block to facilitate lookup.Note that we focus on methodological comparison here, but not the choice of architecture for different methods which has an impact on performance as well.Due to the space limitation and visual density, we only show the most important elements (task, data set, baselines) ranked according to the frequency of use in the literature we have researched.
The main results are as follows.One of the most frequent tasks for evaluating uncertainty estimation methods are regression tasks, where samples close and far away from the training distribution are studied.Furthermore, the calibration of uncertainty estimates in the case of classification problems is very often investigated.Further noteworthy tasks are outof-distribution (OOD) detection and robustness against adversarial attacks.In the medical domain, calibration of semantic segmentation results is the predominant use case.
The choice of data sets is mostly consistent among all reviewed works.For regression, toy data sets are employed for visualization of uncertainty intervals while the UCI data sets are studied in light of (negative) log-likelihood comparison.The most common data sets for calibration and OOD detection are MNIST, CIFAR10 and 100 as well as SVHN while ImageNet and its tiny variant are also studied frequently.These form distinct pairs when OOD detection is studied where models trained on CIFAR variants are evaluated on SVHN and visa versa while MNIST is paired with variants of itself like notMNIST and FashionMNIST.Classification data sets are also commonly distorted and corrupted to study the effects on calibration, blurring the line between OOD detection and adversarial attacks.
Finally, the most commonly used baselines by far are Monte Carlo (MC) Dropout and deep ensembles while the softmax output of deterministic models is almost always employed as a kind of surrogate baseline.It is interesting to note that  [259] inside each approach-BNNs, Ensembles, Single Deterministic Models and Input Augmentation-some baselines are preferred over others.BNNs are most frequently compared against variational inference methods like Bayes' by Backprop (BBB) or Probabilistic Backpropagation (PBP) while for Single Deterministic Models it is more common to compare them against distance-based methods in the case of OOD detection.Overall, BNN methods show a more diverse set of tasks considered while being less frequently evaluated on large data sets like ImageNet.
To further facilitate access for practitioners, we provide web-links to the authors' official implementations (marked by a star) of all common baselines as identified in the baselines column.Where no official implementation is provided, we instead link to the highest ranked implementations found on GitHub at the time of this survey.The list can be also found within our GitHub repository on available implementations 5

VII. APPLICATIONS OF UNCERTAINTY ESTIMATES
From a practical point of view, the main motivation for quantifying uncertainties in DNNs is to be able to classify the received predictions and to make more confident decisions.This section gives a brief overview and examples of the aforementioned motivations.In the first part, we discuss how uncertainty is used within active learning and reinforcement learning.Subsequently, we discuss the interest of the communities working on domain fields like medical image analysis, robotics, and earth observation.These application fields are used representatively for the large number of domains where the uncertainty quantification plays an important role.The challenges and concepts could (and should) be transferred to any application domain of interest.
1) Active Learning: The process of collecting labeled data for supervised training of a DNN can be laborious, timeconsuming, and costly.To reduce the annotation effort, the

Uncertain test predictions
Fig. 10: The active learning framework: The acquisition function evaluates the uncertainties on the network's test predictions in order to select unlabelled data.The selected data are labelled and added to the pool of labelled data, which is used to train and improve the performance of the predictor.
active learning framework shown in Figure 10 trains the DNN sequentially on different labelled data sets increasing in size over time [292].In particular, given a small labelled data set and a large unlabeled data set, a deep neural network trained in the setting of active learning learns from the small labeled data set and decides based on the acquisition function, which samples to select from the pool of unlabeled data.The selected data are added to the training data set and a new DNN is trained on the updated training data set.This process is then repeated with the training set increasing in size over time.Uncertainty sampling is one most popular criteria used in acquisition functions [293] where predictive uncertainty determines which training samples have the highest uncertainty and should be labelled next.Uncertainty based active learning strategies for deep learning applications have been successfully used in several works [23], [24], [294], [25], [26].
2) Reinforcement Learning: The general framework of deep reinforcement learning is shown in Figure 11.In the context of reinforcement learning, uncertainty estimates can be used to solve the exploration-exploitation dilemma.It says that uncertainty estimates can be used to effectively balance the exploration of unknown environments with the exploitation of existing knowledge extracted from known environments.For example, if a robot interacts with an unknown environment, the robot can safely avoid catastrophic failures by reasoning about its uncertainty.To estimate the uncertainty in this framework, Huang et al. [27] used an ensemble of bootstrapped models (models trained on different data sets sampled with replacement from the original data set), while Gal and Ghahramani [20] approximated Bayesian inference via dropout sampling.Inspired by [20] and [27], Kahn et al. [28] and Lötjens et al. [29] used a mixture of deep Bayesian networks performing dropout sampling on an ensemble of bootstrapped models.For further reading, Ghavamzadeh et al. [295] presented a survey of Bayesian reinforcement learning.Fig. 11: The reinforcement learning framework: The agent interacts with the environment by executing a specific action influencing the next state of the agent.The agent observes a reward representing the cost associated with the executed action.The agent chooses actions based on a policy learned by a deep neural network.However, the predicted uncertainty associated with the action predicted by the deep neural network can help the agent to decide weather to execute the predicted action or not.

A. Uncertainty in Real-World Applications
With increasing usage of deep learning approaches within many different fields, quantifying and handling uncertainties has become more and more important.On one hand, uncertainty quantification plays an important role in risk minimization, which is needed in many application fields.On the other hand, many fields offer only challenging data sources, which are hard to control and verify.This makes the generation of trust-worthy ground truth a very challenging task.In the following, three different fields where uncertainty plays an important role are presented, namely Autonomous Driving, medical image analysis, and earth observation.
2) Robotics: Robots are active agents that perceive, decide, plan, and act in the real-world -all based on their incomplete knowledge about the world.As a result, mistakes of the robots not only cause failures of their own mission, but can endanger human lives, e.g. in case of surgical robotics, self-driving cars, space robotics, etc.Hence, the robotics application of deep learning poses unique research challenges that significantly differ from those often addressed in computer vision and other off-line settings [299].For example, the assumption that the testing condition come from the same distribution as training is often invalid in many settings of robotics, resulting in deterioration of the performance of DNNs in uncontrolled and detrimental conditions.This raises the questions how we can quantify the uncertainty in a DNN's predictions in order to avoid catastrophic failures.Answering such questions are important in robotics, as it might be a lofty goal to expect datadriven approaches (in many aspects from control to perception) to always be accurate.Instead, reasoning about uncertainty can help in leveraging the recent advances in deep learning for robotics.
Reasoning about uncertainties and the use of probabilistic representations, as oppose to relying on a single, most-likely estimate, have been central to many domains of robotics research, even before the advent of deep learning [300].In robot perception, several uncertainty-aware methods have been proposed in the past, starting from localization methods [301], [302], [303] to simultaneous localization and mapping (SLAM) frameworks [304], [305], [306], [307].As a result, many probabilistic methods such as factor graphs [308], [309] are now the work-horse of advanced consumer products such as robotic vacuum cleaners and unmanned aerial vehicles.In case of planning and control, estimation problems are widely treated as Bayesian sequential learning problems, and sequential decision making frameworks such as POMDPs [310], [311] assume a probabilistic treatment of the underlying planning problems.With probabilistic representations, many reinforcement learning algorithms are backed up by stability guarantees for safe interactions in the real-world [312], [313], [314].Lastly, there have been also several advances starting from reasoning (semantics [315] to joint reasoning with geometry), embodiment (e.g.active perception [316]) to learning (e.g.active learning [317], [318], [319] and identifying unknown objects [320], [321], [322]).
Similarly, with the advent of deep learning, many researchers proposed new methods to quantify the uncertainty in deep learning as well as on how to further exploit such information.As oppose to many generic approaches, we summarize task-specific methods and their application in practice as followings.Notably, [323] proposed to perform novelty detection using auto-encoders, where the reconstructed outputs of auto-encoders was used to decide how much one can trust the network's predictions.Peretroukhin et al. [324] developed a SO(3) representation and uncertainty estimation framework for the problem of rotational learning problems with uncertainty.[325], [28], [326], [327] demonstrated uncertainty-aware, real world application of a reinforcement learning algorithm for robotics, while [328], [329] proposed to leverage spatial information, on top of MC-dropout.[207], [330], [331] developed deep learning based localization systems along with uncertainty estimates.Other approaches also learn on the robots' past experiences of failures or detect inconsistencies of the predictors [332], [333].In summary, the robotics community has been both, the users and the developers of the uncertainty estimation frameworks targeted to a specific problems.
Yet, robotics pose several unique challenges to uncertainty estimation methods for DNNs.These are for example, (i) how to limit the computational burden and build real-time capable methods that can be executed on the robots with limited computational capacities (e.g.aerial, space robots, etc); (ii) how to leverage spatial and temporal information, as robots sense sequentially instead of having a batch of training data for uncertainty estimates; (iii) whether robots can select the most uncertainty samples and update its learner online; (iv) Whether robots can purposefully manipulate the scene when uncertain.Most of these challenges arise due to the properties of robots that they are physically situated systems.
3) Earth Observation(EO): Earth Observation (EO) systems are increasingly used to make critical decisions related to urban planning [334], resource management [335], disaster response [336], and many more.Right now, there are hundreds of EO satellites in space, owned by different space agencies and private companies.Figure 12 shows the satellites owned by the European Space Agency (ESA).Like in many other domains, deep learning has shown great initial success in the field of EO over the past few years [337].These early successes consisted of taking the latest developments of deep learning in computer vision and applying them to small curated earth observation data sets [337].At the same time, the underlying data is very challenging.Even though the amount of data is huge, so is the variability in the data.This variability is caused by different sensor types, spatial changes (e.g.different regions and resolutions), and temporal changes (e.g.changing light conditions, weather conditions, seasons).Besides the challenge of efficient uncertainty quantification methods for such large amounts of data, several other challenges that can be tackled with uncertainty quantification exist in the field of EO.All in all, the sensitivity of many EO applications together with the nature of EO systems and the challenging EO data make the quantification of uncertainties very important in this field.Despite hundreds of publications in the last years on DL for EO, the range of literature on measuring uncertainties of these systems is relatively small.Furthermore, due to the large variation in the data, a data sample received at test time is often not covered by the training data distribution.For example while preparing training data for a local climate zone classification, the human experts might be presented only with images where there is no obstruction and structures are clearly visible.When a model which is trained on this data set is deployed in real world, it might see the images with clouds obstructing the structures or snow giving them a completely different look.Also, the classes in EO data can have a very wide distribution.For example, there are millions of types of houses in the world and no training data can contain the examples for all of them.The question is where the OOD detector will draw the line and declare the following houses as OOD.Hence, OOD detection is important in earth observation and uncertainty measurements play an important part in this [22].
Another common task in EO, where uncertainties can play an important role, is the data fusion.Optical images normally contain only a few channels like RGB.In contrast to this, EO data can contain optical images with up to hundreds of channels, and a variety of different sensors with different spatial, temporal, and semantic properties.Fusing the information from these different sources and channels propagates the uncertainties from different sources onto the prediction.The challenge lies in developing methods that do not only quantify uncertainties but also the amount of contribution from different channels individually and which learn to focus on the trustworthy data source for a given sample [338].
Unlike normal computer vision scenarios where the image acquisition equipment is quite near to the subject, the EO satellites are hundreds of kilometers away from the subject.The sensitivity of sensors, the atmospheric absorption properties, and surface reflectance properties all contribute to uncertainties in the acquired data.Integrating the knowledge of physical EO systems, which also contain information about uncertainty models in those systems, is another major open issue.However, for several applications in EO, measuring uncertainties is not only something good to have but rather an important requirement of the field.E.g., the geo-variables derived from EO data may be assimilated into process models (ocean, hydrological, weather, climate, etc) and the assimilation requires the probability distribution of the estimated variables.

A. Conclusion -How well do the current uncertainty quantification methods work for real world applications?
Even though many advances on uncertainty quantification in neural networks have been made over the last years, their adoption in practical mission-and safety-critical applications is still limited.There are several reasons for this, which are discussed one-by-one as follows: • Missing Validation of Existing Methods over Real-World Problems Although DNNs have become the defacto standard in solving numerous computer vision and medical image processing tasks, the majority of existing models are not able to appropriately quantify uncertainty that is inherent to their inferences particularly in real world applications.This is primarily because the baseline models are mostly developed using standard data sets such as Cifar10/100, ImageNet, or well known regression data sets that are specific to a particular use case and are therefore not readily applicable to complex real-world environments, as for example low resolutional satellite data or other data sources affected by noise.Although many researchers from other fields apply uncertainty quantification in their field [21], [10], [8], a broad and structured evaluation of existing methods based on different real world applications is not available yet.Works like [56] already built first steps towards a real life evaluation.
• Lack of Standardized Evaluation Protocol Existing methods for evaluating the estimated uncertainty are better suited to compare uncertainty quantification methods based on measurable quantities such as the calibration [340] or the performance on out-of-distribution detection [32].As described in Section VI, these tests are performed on standardized sets within the machine learning community.Furthermore, the details of these experiments might differ in the experimental setting from paper to paper [214].However, a clear standardized protocol of tests that should be performed on uncertainty quantification methods is still not available.For researchers from other domains it is difficult to directly find state of the art methods for the field they are interested in, not to speak of the hard decision on which sub-field of uncertainty quantification to focus.This makes the direct comparison of the latest approaches difficult and also limits the acceptance and adoption of current existing methods for uncertainty quantification.
• Inability to Evaluate Uncertainty Associated to a Single Decision Existing measures for evaluating the estimated uncertainty (e.g., the expected calibration error) are based on the whole testing data set.This means, that equivalent to classification tasks on unbalanced data sets, the uncertainty associated with single samples or small groups of samples may potentially get biased towards the performance on the rest of the data set.But for practical applications, assessing the reliability of a predicted confidence would give much more possibilities than an aggregated reliability based on some testing data, which are independent from the current situation [341].
Especially for mission-and safety-critical applications, pointwise evaluation measures could be of paramount importance and hence such evaluation approaches are very desirable.
• Lack of Ground Truth Uncertainties Current methods are empirically evaluated and the performance is underlined by reasonable and explainable values of uncertainty.A ground truth uncertainty that could be used for validation is in general not available.Additionally, even though existing methods are calibrated on given data sets, one cannot simply transfer these results to any other data set since one has to be aware of shifts in the data distribution and that many fields can only cover a tiny portion of the actual data environment.
In application fields as EO, the preparation of a huge amount of training data is hard and expensive and hence synthetic data can be used to train a model.For this artificial data, artificial uncertainties in labels and data should be taken into account to receive a better understanding of the uncertainty quantification performance.The gap between the real and synthetic data, or estimated and real uncertainty further limits the adoption of currently existing methods for uncertainty quantification.
• Explainability Issue: Besides the explainability of neural networks decisions, existing methods for uncertainty quantification are not well understood on a higher level.For instance, explaining the behavior of single deterministic approaches, ensembles or Bayesian methods is a current direction of research and remains difficult to grasp in every detail [227].It is, however, crucial to understand how those methods operate and capture uncertainty to identify pathways for refinement, detect and characterize uncertainty, failures and important shortcomings [227].

B. Outlook
• Generic Evaluation Framework As already discussed above, there are still problems regarding the evaluation of uncertainty methods, as the lack of 'ground truth' uncertainties, the inability to test on single instances, and standardized benchmarking protocols, etc.To cope with such issues, the provision of an evaluation protocol containing various concrete baseline data sets and evaluation metrics that cover all types of uncertainty would undoubtedly help to boost research in uncertainty quantification.Also, the evaluation with regard to risk-averse and worst case scenarios should be considered there.This means, that uncertainty predictions with a very high predicted uncertainty should never fail, as for example for a prediction of a red or green traffic light.Such a general protocol would enable researchers to easily compare different types of methods against an established benchmark as well as on real world data sets.The adoption of such a standard evaluation protocol should be encouraged by conferences and journals.

• Expert & Systematic Comparison of Baselines
A broad and structured comparison of existing methods for uncertainty estimation on real world applications is not available yet.An evaluation on real world data is even not standard in current machine learning research papers.As a result, given a specific application, it remains unclear which method for uncertainty estimation performs best and whether the latest methods outperform older methods also on real world examples.This is also partly caused by the fact, that researchers from other domains that use uncertainty quantification methods, in general present successful applications of single approaches on a specific problem or a data set by hand.Considering this, there are several points that could be adopted for a better comparison within the different research domains.For instance, domain experts should also compare different approaches against each other and present the weaknesses of single approaches in this domain.Similarly, for a better comparison among several domains, a collection of all the works in the different real world domains could be collected and exchanged on a central platform.Such a platform might also help machine learning researchers in providing an additional source of challenges in the real world and would pave way to broadly highlight weaknesses in the current state of the art approaches.Google's repository on baselines in uncertainties in neural networks [340] 6 could be such a platform and a step towards achieving this goal.

• Uncertainty Ground Truths
It remains difficult to validate existing methods due to the lack of uncertainty ground truths.An actual uncertainty ground truth on which methods can be compared in an ImageNet like manner would make the evaluation of predictions on single samples possible.To reach this, the evaluation of the data generation process and occurring sources of uncertainty, as for example the labeling process, might be investigated in more detail.

• Explainability and Physical Models
Knowing the actual reasons for a false high certainty or a low certainty makes it much easier to engineer the methods for real life applications, which again increases the trust of people into such methods.Recently, Antorán et al. [342] claimed to have published the first work on explainable uncertainty estimation.Uncertainty estimations, in general, form an important step towards explainable artificial intelligence.Explainable uncertainty estimations would give an even deeper understanding of the decision process of a neural network, which, in practical deployment of DNNs, shall incorporate the desired ability to be risk averse while staying applicable in real world (especially safety critical applications).Also, the possibility of improving explainability with physically based arguments offers great potential.While DNNs are very flexible and efficient, they do not directly embed the domain specific expert knowledge that is mostly available and can often be described by mathematical or physical models, as for example earth system science problems [343].Such physic guided models offer a variety of possibilities to include explicit knowledge as well as practical uncertainty representations into a deep learning framework [344], [345].
and (8) for a new sample x * are then predicted based on the known examples by p(y * |x * ) = D p(y * |D, x * ) (9) and y * = arg max y p(y|D, x * ) .

Fig. 2 :
Fig.2:The illustration shows the different steps of a neural network pipeline, based on the earth observation example of land cover classification (here settlement and forest) based on optical images.The different factors that affect the predictive uncertainty are highlighted in the boxes.Factor I is shown as changing environments by cloud covered trees, different types and colors of trees.Factor II is shown by insufficient measurements, that can not directly be used to separate between settlement and forest and by label noise.In practice, the resolution of such images can be low and which would also be part of Factor II.Factor III and Factor IV represent the uncertainties caused by the network structure and the stochastic training process, respectively.Factor V in contrast is represented by feeding the trained network with unknown types of images, namely cows and pigs.

Mean [ y 1 * , y 2 *[ y 1 Fig. 4 :
Fig. 4: A visualization of the basic principles of uncertainty modeling of the four presented general types of uncertainty prediction in neural networks.For a given input sample x * each approach delivers a prediction y * , a representation of model uncertainty σ model and a value of data uncertainty σ data .A) single deterministic model, B) Bayesian neural network, B) ensemble approach, and D) test-time data augmentation.The mean and the standard deviation are only used to keep the visualization simple.In practice other methods, could be utilized.For the deterministic approaches the idea of predicting the parameters of an probability distribution Ξ is visualized, other approaches which base on tools additional to the prediction network are not visualized here.

Fig. 5 :
Fig. 5: Predictions received from a LeNet network trained on MNIST's handwritten digits from 0 to 9 and evaluated on different rotations of test samples.One can clearly see, that for some rotations the network gives a high confidence on the false class due to confusion (e.g.: 3 is confused with 8) or representations not seen at training.These examples represent a simple case of how a basic classification network can lead to overconfident wrong predictions under data distribution shifts.

Fig. 6 :
Fig. 6: The desired behaviors of a Dirichlet distribution over categorical distributions.The visualizations show three Dirichlet distributions over three classes.Each node of the simplex represents one class.In (a) the sharp Dirichlet distribution with its expectation close to the upper node represents a certain prediction of a categorical distribution.In (b) the sharp Dirichlet distribution in the center of the simplex represents high data uncertainty but low distributional uncertainty.In (c) the flat Dirichlet distribution indicates high distributional uncertainty.

Fig. 8 :
Fig. 8: (a) Reliability diagram showing an overconfident classifier: The bin-wise accuracy is smaller than the corresponding confidence.(b) Reliability diagram of an underconfident classifier: The bin-wise accuracy is larger than the corresponding confidence.(c) Reliability diagram of a well calibrated classifier: The confidence fits the actual accuracy for the single bins.
where L b (θ) is the original unregularized loss using training and mixed samples included in batch b and β is a hyperparameter controlling the relative importance given to the batchwise expected calibration error ECE b .By adding the batchwise calibration error for each batch b ∈ B to the standard loss function, the miscalibration induced by mixup training is regularized.In the context of data augmentation, Patel et al. [280] improved the calibration of uncertainty estimates by using onmanifold data augmentation.While mixup training combines training samples, on-manifold adversarial training generate out-of-domain samples using adversarial attack.They experimentally showed that on-manifold adversarial training outperforms mixup training in improving the calibration.Similar to s = y s ) , For the ECE, M equally-spaced bins b 1 , ..., b M are considered, where b m denotes the set of indices of samples whose confidences fall into the interval I m =] m−1 M , m M ].The ECE is then computed as the weighted average of the bin-wise calibration errors, i.e.ECE = M m=1 |b m | N |acc(b m ) − conf(b m )| .

Table I
Visualization of the four different types of uncertainty quantification methods presented in this paper.

TABLE I :
An overview about the four general methods presented in this paper, namely Bayesian Neural Networks, Ensembles, Single Deterministic Neural Networks, and Test-Time Data Augmentation.The labels high and low are given relative to the other approaches and based on the general idea behind them.

TABLE III :
Overview over the properties different types of Bayesian neural networks approaches.The properties are stated relatively among the approaches.The properties can not used as comparison to other uncertainty methods as ensembles, single deterministic models, and test-time augmentation methods.For a comparison of Bayesian methods to these methods see TableI High -M forward passes based on sampled parameters, otherwise intractableLow -Training of one deterministic model and Laplace approximationComputational effort at inferenceHigh -M forward passes based on sampled parameters, otherwise intractableHigh -M forward passes based on sampled parameters, otherwise intractable High -M forward passes based on sampled parameters, otherwise intractable

•
Random Initialization and Data ShuffleDue to the very non-linear loss landscape, different initializations of a neural network lead in general to different training results.Since the training is realized on minibatches, the order of the training data points also affects the final result.
[62]original set.Bagging is sampling from the training data uniformly and with replacement[62].Thanks to the replacement process, ensemble members can see single samples several times in the training set while missing some other training samples.For boosting, the members are trained one after another and the probability of sampling a sample for the next training set is based , the label smoothing resulting from mixup training can be viewed as a form of entropy-based regularization resulting in inherent calibration of networks trained