Quantification of Uncertainties in Neural Networks

Wu, Xinyang; Wagner, Philipp; Huber, Marco F.

doi:10.1007/978-3-031-26490-0_16

Xinyang Wu³,
Philipp Wagner³ &
Marco F. Huber^3,4

5743 Accesses
2 Citations

Abstract

Artificial neural networks only compute point estimates and thus, do not provide the user with a proper decision space. In high-risk use cases, the confidence of the neural network is an important support for decision-making. Bayesian neural networks extend classical deep neural networks with a probability component and allow the user to assess the probability distribution over the prediction. Due to the large number of parameters to be learned, the calculation of the predictive probability can only be performed approximately in practice. In recent years, many methods have been developed to efficiently learn the parameter distributions for Bayesian neural networks. Each of these has different advantages and disadvantages, and thus can be used for different applications. Quantifying uncertainty in the context of neural networks allows the user to interpret the results more comprehensively as well as to assess the risk and therefore makes an important contribution to the user’s digital sovereignty.

You have full access to this open access chapter, Download chapter PDF

A review of predictive uncertainty estimation with machine learning

Article Open access 18 March 2024

Bayesian Networks

Machine Learning Assessment: Implications to Cybersecurity

Keywords

1 Introduction

In classical machine learning, predictions are usually expressed as point estimates. In a point estimate, the learned model simply returns a single value to the user without informing how confident it is about that prediction. In many use cases, this is acceptable. For example, in the case of movie recommendations on streaming platforms or book recommendations based on books already read, single wrong or bad decisions often have no severe consequences. However, the situation is different in critical application areas of artificial intelligence (AI). In medicine, autonomous driving, or quality testing in industrial production, the financial risks but also, and especially, the impact on humans is significantly greater. Here, a desirable behavior of the AI would be that it provides a confidence to its decision or, in case of very uncertain decisions, signals to the user: “I don’t know” or “I am uncertain”.

Developing new methods with this functionality and extending already existing methods with a probability component is the endeavor of the research field probabilistic machine learning. In recent years, the focus here has been particularly on deep neural networks (NNs), as they have become the gold standard for many problems, especially in the area of supervised learning. The quantification of uncertainty by the machine learning model allows the user to assess the quality of the prediction and whether the confidence of the prediction is sufficient for the particular use case. Thus, the user is not left completely alone with the algorithm’s decision, but is provided with additional information to help him evaluating this decision and take additional actions if necessary. These can include, for example, retraining with new data if the learning algorithm has a high uncertainty in general, or transferring a specific decision to a human review if the algorithm’s decision seems too risky. Uncertainty quantification of machine learning algorithms and in particular NNs can thus make an important contribution to digital sovereignty.

The goal of this work is to motivate the quantification of uncertainty for NNs and to present different methods that make this possible in practice. To this end, the different types of probability and the need for approximate methods are first discussed. Afterwards, the Bayesian NN (BNN) is introduced, which is a principled way to realize probabilistic machine learning. The Sects. 3.1–3.6 deal with methods to approximately calculating the posterior probabilities for BNNs. Finally, we discuss the future of the research field and how it can contribute to digital sovereignty.

2 Probabilistic Machine Learning

2.1 Basic Principles

In supervised learning, we always consider a dataset $\mathcal {D} = \{ (\textbf{x}_i, \textbf{y}_i) \}_{i=1}^N$ with N inputs $\textbf{x}$ and outputs $\textbf{y}$. While in classical machine learning a function f with parameters $\theta $ is to be learned with $f_\theta (\textbf{x}_i) = \textbf{y}_i$, in probabilistic machine learning the output or predictive probability distribution $p(\textbf{y}|\textbf{x}, \mathcal {D})$ is to be learned. $p(\textbf{y}|\textbf{x}, \mathcal {D})$ is a conditional probability distribution that gives a probability of output $\textbf{y}$ based on the training data $\mathcal {D}$ and the current data point $\textbf{x}$. In order to calculate $p(\textbf{y}|\textbf{x}, \mathcal {D})$, the parameters of the machine learning model used are also assumed to be probabilistic. By integrating over the probability distribution of the network parameters, $p(\textbf{y}|\textbf{x}, \mathcal {D}) = \int p(\textbf{y}|\textbf{x},\theta )\cdot p(\theta |\mathcal {D}){\text {d}}\!\,\theta $ can be used to compute the probability distribution of the output $\textbf{y}$.

Illustratively, by means of the parameter $\theta $, all possible realizations of an NN are considered here in terms of the probability distribution $p(\textbf{y}|\textbf{x},\theta )$. The output of all possible NNs is averaged, with each NN being weighted by $p(\theta |\mathcal {D})$, which represents the probability of different parameter realizations given the training data. Thus, the flexibility and variability in the choice of model parameters is taken into account when calculating the output probability distribution.

In principle, two types of uncertainties must be distinguished in this context. The aleatoric uncertainty describes the intrinsic uncertainty within the data used, which is often also referred to as measurement noise. The epistemic uncertainty describes the lack of model knowledge, which in our case is reflected in the probability distribution over the model parameters. Based on the available data, for example, different parameter configurations can lead to very similar prediction results. The problem, however, is that the distribution $p(\theta |\mathcal {D})$ can mostly only be calculated approximately. For a large number of parameters $\theta $ a high-dimensional integration arises, which can be solved exactly only in exceptional cases. Here, it is often assumed that all relevant quantities are normally distributed in order to simplify the problem.

2.2 Bayesian Neural Networks

If NNs are extended by a probability distribution over their weights, they are called BNNs (see Fig. 1). The concept of BNN has existed for several decades [1], but gained renewed attention in recent years due to the popularity of deep NN. In general, a BNN is characterized not only by model parameters $\theta $, i.e., the weights, but also by a probability distribution $p(\theta )$ over all weights. After successful training, this depends on the learned data $\mathcal {D}$ and becomes the posterior distribution $p(\theta |\mathcal {D})$. During the training process, the famous Bayes’ rule is used to process the information about the training data in order to adjust the distribution $p(\theta )$. The Bayesian rule can be formulated for a BNN as

$$\begin{aligned} p(\theta |\mathcal {D}) = \frac{p(\textbf{Y}|\textbf{X}, \theta )\cdot p(\theta )}{p(\textbf{Y}|\textbf{X})} \end{aligned}$$

(1)

with the training data $\textbf{X} \triangleq [\textbf{x}_1\, \ldots \, \textbf{x}_N]$, $\textbf{Y} \triangleq [\textbf{y}_1\, \ldots \, \textbf{y}_N]$. Here $p(\theta )$ denotes the prior distribution, which is assumed to be the initial distribution of weights. This can either already contain information and prior knowledge about the problem or be as general and uninformative as possible. $p(\textbf{Y}|\textbf{X}, \theta )$ is called likelihood. It describes how well the training data can be modeled with the available model parameters $\theta $. $p(\textbf{Y}|\textbf{X})$ is the so-called evidence. It serves as a normalization factor and describes the principal probability over the training data independent of the parameter choice. The distribution over the model parameters is adjusted using Eq. (1) as new data becomes available. $p(\textbf{Y}|\textbf{X})$ cannot be calculated exactly in general, which is why Eq. (1) can only be solved approximately in practice. The following section deals with different methods for the approximate calculation of the posterior distribution of the parameters of a BNN based on the available data.

3 Overview of Methods

3.1 The Dropout Method

Despite the high performance of deep NN, overfitting to the training data is a major challenge in many use cases. Hinton et al. [2] and Srivastava et al. [3] introduced the dropout method to reduce the negative impact of overfitting for deep NN. Subsequently, many papers have been published dealing with the functionality and theoretical understanding of the dropout method. For example, Baldi and Sadowski [4] proposed to interpret dropout as an $l_2$ regularizer in the training process of deep NN. Damianou and Lawrence [5], among others, have proved that a deep NN with dropout layers in front of each hidden layer is mathematically approximately equivalent to a Gaussian process. Based on this, Gal and Ghahramani [6] took one step further and prove that by using dropout, training the deep NN can be viewed as minimizing the Kullback-Leibler divergence (KL divergence) between the approximate probability distribution of the deep NN and the posterior distribution of the underlying Gaussian process. And by using dropout before each hidden layer, regardless of the type of hidden layers (fully connected, convolutional, or recurrent), the deep NN can be considered as BNN with uncertainty quantification.

The application of the dropout method is very simple: First, a dropout layer with an appropriate dropout rate must be added before each hidden layer, regardless of whether it is the first layer after the input layer or the last layer before the output layer. In addition, a regularizer must be chosen for the dropout layer. The authors recommend $l_2$ regularization if the goal is to have uncertainty increase far from the data.

For normal deep NNs, the dropout layers are only active during training and are switched off in the inference phase. In the case of BNNs, the dropout layers remain active during the inference phase to provide an estimate of the probability distribution instead of a point estimate for a prediction. This means that even though the deep NN has exactly the same input, it can make different predictions because the structure of the network is slightly different for each inference due to the dropout layer. Therefore, for each input, we can get n different output values for n times inference. Gal and Ghahramani [6] proved that the mean and standard deviation of these n output values are approximately equal to the mean and standard deviation of the posterior Gaussian distribution of the underlying Gaussian process with the given input. In practice, the mean is considered as the final prediction for the given input and the standard deviation quantifies the uncertainty for this prediction. Considering that the estimation of the probability distribution is based on the multiple repetition of the inference process, the method is also called Monte Carlo Dropout (MC Dropout).

A disadvantage of MC Dropout is that this method introduces new hyperparameters, for example, the dropout rate for the dropout layers. To address this issue, Gal et al. published a new method called Concrete Dropout in [7] that allows automatic exploration of the dropout rate and allows deep NNs to dynamically adjust their uncertainty quantification as more data are observed. This variant saves the user time for fine-tuning the dropout rate. However, the training and inference of this method requires more resources than normal dropout-based BNN.

3.2 Ensembles

As mentioned in the previous section, in MC Dropout the prediction is summarized from the multiple repetitions of the network’s inference processes with the same input, and in each inference the network structure is slightly changed due to the active dropout layers. In this regard, MC dropout could also be interpreted as ensemble of multiple deep NNs [3], where the individual NNs differ due to the dropout layers but still share most of the parameters. Lakshminarayanan et al. propose in [8] to use an ensemble of several differently initialized NNs with the same structure directly as an approximation to the Bayesian inference model instead of MC dropout. The advantage is that additional hyperparameters, such as dropout rates for each dropout layer, are avoided. This interpretation motivated the authors to investigate ensembles under the name Deep Ensemble as an alternative approach for uncertainty quantification of deep NNs.

In Fig. 2 we show the difference between MC Dropout and Deep Ensemble. In the case of Deep Ensemble, all the networks in the ensemble have the same structure, but have been randomly initialized differently for training, and the data points in the dataset are randomly shuffled for training each network.

Based on the Deep Ensemble method of Lakshminarayanan et al., Pearce et al. proposed a new method called Anchored Ensembling in [9]. Compared to Deep Ensemble, Anchored Ensembling regularizes the parameters of the deep NN with assumed prior probability distributions. The authors report better performance and more accurate probability estimation. However, the prior probability distribution must be carefully chosen. In our experiments, the performance of this method has been shown to be sensitive to the selected hyperparameters.

3.3 Variational Inference

In addition to the previously presented simple methods for approximate Bayesian inference, Variational Inference (VI) attempts to formulate and solve the Bayesian inference problem as an optimization problem. In this section, we will first explain the mathematical principles behind this method and then introduce some useful tools for the application of VI.

As explained in Sect. 2.2, a common problem with BNN is that the evidence $p(\textbf{Y}|\textbf{X})$ is difficult or even impossible to calculate exactly. To avoid the intractability of the evidence in Eq. (1), VI adopts a simpler surrogate function $q(\theta )$ to approximate the true posterior probability distribution $p(\theta |\mathcal {D})$. The measurement of the similarity between two probability distributions is the KL divergence

$$\begin{aligned} \textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D}))=\int {q(\theta )}\left[ \log \frac{q(\theta )}{p(\theta |\mathcal {D})}\right] {\text {d}}\!\,\theta ~. \end{aligned}$$

(2)

Here, the optimal surrogate function for the approximation is exactly the function that minimizes the KL divergence, i.e.,

$$\begin{aligned} q^{*}(\theta )=\mathrm {arg\, min}\,\textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D}))~. \end{aligned}$$

(3)

Bishop et al. [29] proved that minimizing the KL divergence is equivalent to maximizing the evidence lower bound loss function (ELBO)

$$\begin{aligned} \textrm{ELBO}(q(\theta ))= & {} \int q(\theta ) [\log p(\textbf{Y}, \theta |\textbf{X}) -\log q(\theta )]{\text {d}}\!\,\theta \nonumber \\= & {} \int q(\theta )\log p(\textbf{Y}|\textbf{X}, \theta ){\text {d}}\!\,\theta -\textrm{KL}(q(\theta ) \Vert p(\theta ))~. \end{aligned}$$

(4)

Here, (4) is equal to the term $\log p(\textbf{Y}| \textbf{X}) - \textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D}))$. Given that the KL divergence is always non-negative, it can be seen that $\textrm{ELBO}(q(\theta ))$ is always less than or equal to the log-evidence $\log p(\textbf{Y}| \textbf{X})$, which explains the name of this loss function. Since the evidence is constant and independent of $\theta $, maximizing the ELBO automatically leads to minimizing the KL divergence. Consequently, the optimization problem in Eq. (3) can be rewritten into

$$\begin{aligned} q^{*}(\theta )= & {} \mathrm {arg\, min}\,\textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D})) \nonumber \\= & {} \mathrm {arg\, max} \int q(\theta )\log p(\textbf{Y}|\textbf{X}, \theta ){\text {d}}\!\,\theta -\textrm{KL}(q(\theta ) \Vert p(\theta ))~. \end{aligned}$$

(5)

In Eq. (5), it can be seen that the intractable evidence $p(\textbf{Y}|\textbf{X})$ is not required as the posterior distribution $p(\theta | \mathcal D)$ is avoided. Instead, we only need to process the known priori distribution $p(\theta )$ and the likelihood $p(\textbf{Y}|\textbf{X}, \theta )$, which are easier to handle. In this way, we simplify the inference problem into a solvable and tractable optimization problem.

If we examine Eq. (5) again, we notice that the first part of the equation corresponds to a maximum likelihood estimator and the second part of the equation adds a prior part to the ELBO loss function to make the estimation based on the prior knowledge, hence it resembles the maximum a posteriori estimator in this aspect.

Several open-source libraries for Bayesian inference and probabilistic modeling for machine learning have already been developed. Table 1 includes a comparison and summary of the most popular probabilistic programming language tools. In our work and research, we usually use Pyro and Tensorflow Probability (TFP), depending on the deep learning frameworks we choose. The two libraries are preferred because they are under active development and are associated with well-known frameworks for deep NNs.

Table 1. Current open-source libraries for Bayesian inference and probabilistic modeling. All libraries support VI and MCMC methods.

Full size table

3.4 Laplace Approximation

The Laplace Approximation is based on a relatively simple idea [10]. Often, only the region around the maximum is of interest in the posterior distribution. Provided that this maximum is known, the distribution in its neighborhood can be approximated with a Taylor-series expansion. The maximum $\theta _\text {MAP}$ can be determined by classical training via gradient descent. For the Taylor-series expansion up to the second order we can obtain

$$\begin{aligned} \log p(\theta | \mathcal {D}) \approx \log p(\theta _\text {MAP}|\mathcal {D}) - \frac{1}{2} (\theta - \theta _\text {MAP})^\top \textbf{H}\, (\theta - \theta _\text {MAP}) \end{aligned}$$

(6)

with the Hessian matrix $\textbf{H}$. The first term of the Taylor-series expansion vanishes as we expand around a maximum. The Taylor-series expansion around $\theta _\text {MAP}$ thus results in a normal distribution of the form

$$\begin{aligned} p(\theta | \mathcal {D}) \approx p(\theta _\text {MAP}|\mathcal {D}) \cdot \exp \left( - \frac{1}{2} (\theta - \theta _\text {MAP})^\top \textbf{H}\, (\theta - \theta _\text {MAP}) \right) \end{aligned}$$

(7)

with Hessian matrix being the inverse of the covariance matrix. However, this approximation of $p(\theta |\mathcal {D})$ is not normalized and thus not yet a valid probability density. With the normalization factor of a multivariate normal distribution we obtain

$$\begin{aligned} p(\theta | \mathcal {D}) \approx \frac{1}{\sqrt{|2 \pi \textbf{H}}|} \exp \left( - \frac{1}{2} (\theta - \theta _\text {MAP})^\top \textbf{H}\, (\theta - \theta _\text {MAP}) \right) ~. \end{aligned}$$

(8)

A limiting factor is the calculation of the Hessian matrix, since it can quickly become very large for a huge number of parameters. In practice, therefore, further approximation methods are often used to calculate $\textbf{H}$. This allows the method to be used even for large NNs. An advantage of the Laplace approximation is that it can also be applied to already trained NNs to add an uncertainty component in a post-hoc fashion. Since the method approximates the true distribution only locally, the approximated distribution can in principle deviate strongly from the true one. In most cases, however, especially for large data sets, satisfactory results can be achieved with the Laplace approximation.

3.5 Kalman Filter-Based Approaches

While VI treats the Bayesian inference problem as an optimization problem using the ELBO loss function, many other researchers attempt to treat this problem as a filtering problem using Kalman filters. Classical Kalman filters are applicable only to linear systems; however, many variants extend them quite well to nonlinear systems. Singhal and Wu presented the first algorithm in 1989 that uses the extended Kalman filter to train BNN [11]. Compared to the normal gradient-based and batch-based methods, such Kalman filter-based methods proved to be much more effective than the standard backpropagation in terms of the number of training epochs [12]. Watanabe and Tzafesta proposed a different approach in [13], in which the weights of the networks are assumed to be Gaussian distributed and the mean and variance of each weight are updated using an extended Kalman filter, but which requires local linearization for updating the neurons in hidden layers. This method was extended by Puskorius and Feldkamp [14] to allow for layer-wise correlated or even network-wide correlated neurons.

To avoid linearization, Huber proposes in [15] the so-called Bayesian Perceptron. Even though it is restricted to a single neuron, this work proves that a closed-form computation of the mean and covariance matrix of the posterior weight distribution is possible, provided that the weights are assumed to be Gaussian distributed. Based on the Bayesian Perceptron, Wagner et al. in [16] extended this method from a single neuron to a multilayer perceptron (MLP) and called it Kalman Bayesian Neural Network (KBNN). In this work, a closed-form forward and backward propagation of the weight distributions in each layer is introduced. This method shows its advantages in terms of online learning capacity and learning efficiency compared to other popular BNN methods such as VI and MCMC. Figure 3 shows a comparison between different BNN methods. Among them, KBNN has the best learning efficiency. Chen et al. introduced another method [17] in 2021, which uses the ensemble Kalman filter to handle the measurement noise in the data and to account for it in the uncertainty quantification.

3.6 Markov Chain Monte Carlo

Another popular and well-researched approach for Bayesian inference to learn BNNs is the Markov Chain Monte Carlo (MCMC) method. A Markov chain describes a process in which the current state depends only on the last state. Under certain assumptions and starting from an initial state, a Markov chain always converges to a stationary distribution. Once this distribution is reached, all further states will correspond to this distribution. However, it is in general not known a priori how many steps along the Markov chain are necessary until the stationary distribution is reached.

In the context of Bayesian inference, Monte Carlo methods without Markov chains form a broad class of sampling algorithms that use repeated random sampling to generate samples from complex posterior distributions. Rejection sampling [23] is the basic Monte Carlo method to generate samples from a given distribution. However, rejection sampling is very inefficient because the samples are uncorrelated. Combining these sampling methods with various algorithms to construct Markov chains for the desired probability distribution (e.g., the posterior distribution of weights in BNNs) yields MCMC methods whose stationary distribution are proportionate to the desired posterior distribution. Hence, the samples of the stationary distribution represent an approximation of the posterior distribution and thus, characteristic parameters of this distribution such as the mean or the variance can be calculated.

One of the earliest MCMC methods is the Metropolis-Hastings algorithm [24, 25]. Many improvements have been proposed for that, such as Gibbs sampling [26], hybrid Monte-Carlo or Hamiltonian-Monte-Carlo (HMC) [27]. A major extension of HMC is the No-U-Turn Sampler (NUTS) [28], which usually works much more efficiently. The libraries listed in Table 1 support MCMC methods as well as VI.

4 Conclusion

In this paper, we have presented BNNs as an approach to achieve uncertainty quantification in the field of deep learning. Similar to other Bayesian inference problems, BNNs assume a prior probability distribution of the weights of the NN and attempt to learn a posterior distribution of these weights using the available data.

Several methods have been proposed in the literature to realize the training and inference of BNNs as efficiently as possible. Dropout and ensemble methods are among the simple variants, where the learning and inference of BNNs is considered as aggregation through an ensemble of different networks. However, the weights are considered to be deterministic. In the other, “real” Bayesian methods, the weights of the NN are modeled as random variables. Variational inference formulates the learning task as an optimization problem, which is approximately solved by using the ELBO loss function. In comparison, MCMC algorithms use Markov chains to draw samples from the posterior distribution of weights. While the Laplace approximation utilizes the local Taylor-series expansion to obtain an additional uncertainty quantification on trained NNs, Kalman filter-based methods propose to formulate the calculation of the posterior distribution of the weights as a filtering problem in which the weights are recursively updated.

Compared to conventional NNs, BNNs offer advantages in active learning, causal inference, out-of-distribution detection, and security-related use cases thanks to the capability of uncertainty quantification. In the future, we focus on the scalability of different methods and the applications of BNNs in novel fields, such as reinforcement learning and verification of AI. We believe that uncertainty quantification will be crucial for increasing the reliability and explainability of AI systems, which are key elements for digital sovereignty at the level of robustness and trustworthiness.

References

Neal, R.M.: Bayesian Learning for Neural Networks (Doctoral dissertation, University of Toronto) (1995)
Google Scholar
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Baldi, P., Sadowski, P.J.: Understanding dropout. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Damianou, A., Lawrence, N.D.: Deep gaussian processes. In: Artificial Intelligence and Statistics, pp. 207–215. PMLR (2013)
Google Scholar
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016)
Google Scholar
Gal, Y., Hron, J., Kendall, A.: Concrete dropout. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Pearce, T., Leibfried, F., Brintrup, A.: Uncertainty in neural networks: approximately Bayesian ensembling. In: International Conference on Artificial Intelligence and Statistics, pp. 234–244. PMLR (2020)
Google Scholar
Murphy, K.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)
MATH Google Scholar
Singhal, S., Wu, L.: Training multilayer perceptrons with the extended Kalman algorithm. In: Advances in Neural Information Processing Systems, vol. 1 (1988)
Google Scholar
Haykin, S.S. (ed.): Kalman Filtering and Neural Networks, vol. 284. Wiley, New York (2001)
Google Scholar
Watanabe, K., Tzafestas, S.G.: Learning algorithms for neural networks with the Kalman filters. J. Intell. Rob. Syst. 3(4), 305–319 (1990). https://doi.org/10.1007/BF00439421
Article Google Scholar
Puskorius, G.V., Feldkamp, L.A.: Parameter-based Kalman filter training: theory and implementation. Kalman Filter. Neural Netw., 23–67 (2001)
Google Scholar
Huber, M.F.: Bayesian perceptron: towards fully Bayesian neural networks. In: 2020 59th IEEE Conference on Decision and Control (CDC), pp. 3179–3186. IEEE (2020)
Google Scholar
Wagner, P., Wu, X., Huber, M.F.: Kalman Bayesian neural networks for closed-form online learning. In: 37th AAAI Conference on Artificial Intelligence (2023)
Google Scholar
Chen, C., Lin, X., Huang, Y., Terejanu, G.: Approximate Bayesian neural network trained with ensemble Kalman filter. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)
Google Scholar
Tran, D., Kucukelbir, A., Dieng, A.B., Rudolph, M., Liang, D., Blei, D.M.: Edward: a library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787 (2016)
Gelman, A., Lee, D., Guo, J.: Stan: a probabilistic programming language for Bayesian inference and optimization. J. Educ. Behav. Stat. 40(5), 530–543 (2015)
Article Google Scholar
Salvatier, J., Wiecki, T.V., Fonnesbeck, C.: Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2, e55 (2016)
Article Google Scholar
Bingham, E., et al.: Pyro: Deep universal probabilistic programming. J. Mach. Learn. Res. 20(1), 973–978 (2019)
Google Scholar
Dillon, J.V., et al.: Tensorflow distributions. arXiv preprint arXiv:1711.10604 (2017)
Von Neumann, J.: 13. various techniques used in connection with random digits. Appl. Math. Ser. 12(36–38), 3 (1951)
Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Article MATH Google Scholar
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications (1970)
Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Article MATH Google Scholar
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)
Article MathSciNet Google Scholar
Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)
MathSciNet MATH Google Scholar
Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 4(4), p. 738. Springer, New York (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer Institute for Manufacturing Engineering and Automation IPA, Stuttgart, Germany
Xinyang Wu, Philipp Wagner & Marco F. Huber
Institute of Industrial Manufacturing and Management IFF, University of Stuttgart, Stuttgart, Germany
Marco F. Huber

Authors

Xinyang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Marco F. Huber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco F. Huber .

Editor information

Editors and Affiliations

Institut für Innovation und Technik, VDI/VDE Innovation + Technik GmbH, Berlin, Germany
Alexandra Shajek
Institut für Innovation und Technik, VDI/VDE Innovation + Technik GmbH, Berlin, Germany
Ernst Andreas Hartmann

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wu, X., Wagner, P., Huber, M.F. (2023). Quantification of Uncertainties in Neural Networks. In: Shajek, A., Hartmann, E.A. (eds) New Digital Work. Springer, Cham. https://doi.org/10.1007/978-3-031-26490-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-26490-0_16
Published: 27 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26489-4
Online ISBN: 978-3-031-26490-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Quantification of Uncertainties in Neural Networks

Abstract

Similar content being viewed by others