Keywords

1 Introduction

In classical machine learning, predictions are usually expressed as point estimates. In a point estimate, the learned model simply returns a single value to the user without informing how confident it is about that prediction. In many use cases, this is acceptable. For example, in the case of movie recommendations on streaming platforms or book recommendations based on books already read, single wrong or bad decisions often have no severe consequences. However, the situation is different in critical application areas of artificial intelligence (AI). In medicine, autonomous driving, or quality testing in industrial production, the financial risks but also, and especially, the impact on humans is significantly greater. Here, a desirable behavior of the AI would be that it provides a confidence to its decision or, in case of very uncertain decisions, signals to the user: “I don’t know” or “I am uncertain”.

Developing new methods with this functionality and extending already existing methods with a probability component is the endeavor of the research field probabilistic machine learning. In recent years, the focus here has been particularly on deep neural networks (NNs), as they have become the gold standard for many problems, especially in the area of supervised learning. The quantification of uncertainty by the machine learning model allows the user to assess the quality of the prediction and whether the confidence of the prediction is sufficient for the particular use case. Thus, the user is not left completely alone with the algorithm’s decision, but is provided with additional information to help him evaluating this decision and take additional actions if necessary. These can include, for example, retraining with new data if the learning algorithm has a high uncertainty in general, or transferring a specific decision to a human review if the algorithm’s decision seems too risky. Uncertainty quantification of machine learning algorithms and in particular NNs can thus make an important contribution to digital sovereignty.

The goal of this work is to motivate the quantification of uncertainty for NNs and to present different methods that make this possible in practice. To this end, the different types of probability and the need for approximate methods are first discussed. Afterwards, the Bayesian NN (BNN) is introduced, which is a principled way to realize probabilistic machine learning. The Sects. 3.13.6 deal with methods to approximately calculating the posterior probabilities for BNNs. Finally, we discuss the future of the research field and how it can contribute to digital sovereignty.

2 Probabilistic Machine Learning

2.1 Basic Principles

In supervised learning, we always consider a dataset \(\mathcal {D} = \{ (\textbf{x}_i, \textbf{y}_i) \}_{i=1}^N\) with N inputs \(\textbf{x}\) and outputs \(\textbf{y}\). While in classical machine learning a function f with parameters \(\theta \) is to be learned with \(f_\theta (\textbf{x}_i) = \textbf{y}_i\), in probabilistic machine learning the output or predictive probability distribution \(p(\textbf{y}|\textbf{x}, \mathcal {D})\) is to be learned. \(p(\textbf{y}|\textbf{x}, \mathcal {D})\) is a conditional probability distribution that gives a probability of output \(\textbf{y}\) based on the training data \(\mathcal {D}\) and the current data point \(\textbf{x}\). In order to calculate \(p(\textbf{y}|\textbf{x}, \mathcal {D})\), the parameters of the machine learning model used are also assumed to be probabilistic. By integrating over the probability distribution of the network parameters, \(p(\textbf{y}|\textbf{x}, \mathcal {D}) = \int p(\textbf{y}|\textbf{x},\theta )\cdot p(\theta |\mathcal {D}){\text {d}}\!\,\theta \) can be used to compute the probability distribution of the output \(\textbf{y}\).

Illustratively, by means of the parameter \(\theta \), all possible realizations of an NN are considered here in terms of the probability distribution \(p(\textbf{y}|\textbf{x},\theta )\). The output of all possible NNs is averaged, with each NN being weighted by \(p(\theta |\mathcal {D})\), which represents the probability of different parameter realizations given the training data. Thus, the flexibility and variability in the choice of model parameters is taken into account when calculating the output probability distribution.

In principle, two types of uncertainties must be distinguished in this context. The aleatoric uncertainty describes the intrinsic uncertainty within the data used, which is often also referred to as measurement noise. The epistemic uncertainty describes the lack of model knowledge, which in our case is reflected in the probability distribution over the model parameters. Based on the available data, for example, different parameter configurations can lead to very similar prediction results. The problem, however, is that the distribution \(p(\theta |\mathcal {D})\) can mostly only be calculated approximately. For a large number of parameters \(\theta \) a high-dimensional integration arises, which can be solved exactly only in exceptional cases. Here, it is often assumed that all relevant quantities are normally distributed in order to simplify the problem.

2.2 Bayesian Neural Networks

If NNs are extended by a probability distribution over their weights, they are called BNNs (see Fig. 1). The concept of BNN has existed for several decades [1], but gained renewed attention in recent years due to the popularity of deep NN. In general, a BNN is characterized not only by model parameters \(\theta \), i.e., the weights, but also by a probability distribution \(p(\theta )\) over all weights. After successful training, this depends on the learned data \(\mathcal {D}\) and becomes the posterior distribution \(p(\theta |\mathcal {D})\). During the training process, the famous Bayes’ rule is used to process the information about the training data in order to adjust the distribution \(p(\theta )\). The Bayesian rule can be formulated for a BNN as

$$\begin{aligned} p(\theta |\mathcal {D}) = \frac{p(\textbf{Y}|\textbf{X}, \theta )\cdot p(\theta )}{p(\textbf{Y}|\textbf{X})} \end{aligned}$$
(1)

with the training data \(\textbf{X} \triangleq [\textbf{x}_1\, \ldots \, \textbf{x}_N]\), \(\textbf{Y} \triangleq [\textbf{y}_1\, \ldots \, \textbf{y}_N]\). Here \(p(\theta )\) denotes the prior distribution, which is assumed to be the initial distribution of weights. This can either already contain information and prior knowledge about the problem or be as general and uninformative as possible. \(p(\textbf{Y}|\textbf{X}, \theta )\) is called likelihood. It describes how well the training data can be modeled with the available model parameters \(\theta \). \(p(\textbf{Y}|\textbf{X})\) is the so-called evidence. It serves as a normalization factor and describes the principal probability over the training data independent of the parameter choice. The distribution over the model parameters is adjusted using Eq. (1) as new data becomes available. \(p(\textbf{Y}|\textbf{X})\) cannot be calculated exactly in general, which is why Eq. (1) can only be solved approximately in practice. The following section deals with different methods for the approximate calculation of the posterior distribution of the parameters of a BNN based on the available data.

Fig. 1.
figure 1

(Source: own illustration).

Illustration of a BNN with a single hidden layer (green) compared to a classical NN. The weights are represented as connections between neurons. The weights of the BNN are associated with a probability distribution to model the uncertainty in the parameter choice, while the weights of the NN are deterministic quantities

3 Overview of Methods

3.1 The Dropout Method

Despite the high performance of deep NN, overfitting to the training data is a major challenge in many use cases. Hinton et al. [2] and Srivastava et al. [3] introduced the dropout method to reduce the negative impact of overfitting for deep NN. Subsequently, many papers have been published dealing with the functionality and theoretical understanding of the dropout method. For example, Baldi and Sadowski [4] proposed to interpret dropout as an \(l_2\) regularizer in the training process of deep NN. Damianou and Lawrence [5], among others, have proved that a deep NN with dropout layers in front of each hidden layer is mathematically approximately equivalent to a Gaussian process. Based on this, Gal and Ghahramani [6] took one step further and prove that by using dropout, training the deep NN can be viewed as minimizing the Kullback-Leibler divergence (KL divergence) between the approximate probability distribution of the deep NN and the posterior distribution of the underlying Gaussian process. And by using dropout before each hidden layer, regardless of the type of hidden layers (fully connected, convolutional, or recurrent), the deep NN can be considered as BNN with uncertainty quantification.

The application of the dropout method is very simple: First, a dropout layer with an appropriate dropout rate must be added before each hidden layer, regardless of whether it is the first layer after the input layer or the last layer before the output layer. In addition, a regularizer must be chosen for the dropout layer. The authors recommend \(l_2\) regularization if the goal is to have uncertainty increase far from the data.

For normal deep NNs, the dropout layers are only active during training and are switched off in the inference phase. In the case of BNNs, the dropout layers remain active during the inference phase to provide an estimate of the probability distribution instead of a point estimate for a prediction. This means that even though the deep NN has exactly the same input, it can make different predictions because the structure of the network is slightly different for each inference due to the dropout layer. Therefore, for each input, we can get n different output values for n times inference. Gal and Ghahramani [6] proved that the mean and standard deviation of these n output values are approximately equal to the mean and standard deviation of the posterior Gaussian distribution of the underlying Gaussian process with the given input. In practice, the mean is considered as the final prediction for the given input and the standard deviation quantifies the uncertainty for this prediction. Considering that the estimation of the probability distribution is based on the multiple repetition of the inference process, the method is also called Monte Carlo Dropout (MC Dropout).

A disadvantage of MC Dropout is that this method introduces new hyperparameters, for example, the dropout rate for the dropout layers. To address this issue, Gal et al. published a new method called Concrete Dropout in [7] that allows automatic exploration of the dropout rate and allows deep NNs to dynamically adjust their uncertainty quantification as more data are observed. This variant saves the user time for fine-tuning the dropout rate. However, the training and inference of this method requires more resources than normal dropout-based BNN.

3.2 Ensembles

As mentioned in the previous section, in MC Dropout the prediction is summarized from the multiple repetitions of the network’s inference processes with the same input, and in each inference the network structure is slightly changed due to the active dropout layers. In this regard, MC dropout could also be interpreted as ensemble of multiple deep NNs [3], where the individual NNs differ due to the dropout layers but still share most of the parameters. Lakshminarayanan et al. propose in [8] to use an ensemble of several differently initialized NNs with the same structure directly as an approximation to the Bayesian inference model instead of MC dropout. The advantage is that additional hyperparameters, such as dropout rates for each dropout layer, are avoided. This interpretation motivated the authors to investigate ensembles under the name Deep Ensemble as an alternative approach for uncertainty quantification of deep NNs.

Fig. 2.
figure 2

(Source: own illustration).

Difference between MC Dropout and Deep Ensemble

In Fig. 2 we show the difference between MC Dropout and Deep Ensemble. In the case of Deep Ensemble, all the networks in the ensemble have the same structure, but have been randomly initialized differently for training, and the data points in the dataset are randomly shuffled for training each network.

Based on the Deep Ensemble method of Lakshminarayanan et al., Pearce et al. proposed a new method called Anchored Ensembling in [9]. Compared to Deep Ensemble, Anchored Ensembling regularizes the parameters of the deep NN with assumed prior probability distributions. The authors report better performance and more accurate probability estimation. However, the prior probability distribution must be carefully chosen. In our experiments, the performance of this method has been shown to be sensitive to the selected hyperparameters.

3.3 Variational Inference

In addition to the previously presented simple methods for approximate Bayesian inference, Variational Inference (VI) attempts to formulate and solve the Bayesian inference problem as an optimization problem. In this section, we will first explain the mathematical principles behind this method and then introduce some useful tools for the application of VI.

As explained in Sect. 2.2, a common problem with BNN is that the evidence \(p(\textbf{Y}|\textbf{X})\) is difficult or even impossible to calculate exactly. To avoid the intractability of the evidence in Eq. (1), VI adopts a simpler surrogate function \(q(\theta )\) to approximate the true posterior probability distribution \(p(\theta |\mathcal {D})\). The measurement of the similarity between two probability distributions is the KL divergence

$$\begin{aligned} \textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D}))=\int {q(\theta )}\left[ \log \frac{q(\theta )}{p(\theta |\mathcal {D})}\right] {\text {d}}\!\,\theta ~. \end{aligned}$$
(2)

Here, the optimal surrogate function for the approximation is exactly the function that minimizes the KL divergence, i.e.,

$$\begin{aligned} q^{*}(\theta )=\mathrm {arg\, min}\,\textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D}))~. \end{aligned}$$
(3)

Bishop et al. [29] proved that minimizing the KL divergence is equivalent to maximizing the evidence lower bound loss function (ELBO)

$$\begin{aligned} \textrm{ELBO}(q(\theta ))= & {} \int q(\theta ) [\log p(\textbf{Y}, \theta |\textbf{X}) -\log q(\theta )]{\text {d}}\!\,\theta \nonumber \\= & {} \int q(\theta )\log p(\textbf{Y}|\textbf{X}, \theta ){\text {d}}\!\,\theta -\textrm{KL}(q(\theta ) \Vert p(\theta ))~. \end{aligned}$$
(4)

Here, (4) is equal to the term \(\log p(\textbf{Y}| \textbf{X}) - \textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D}))\). Given that the KL divergence is always non-negative, it can be seen that \(\textrm{ELBO}(q(\theta ))\) is always less than or equal to the log-evidence \(\log p(\textbf{Y}| \textbf{X})\), which explains the name of this loss function. Since the evidence is constant and independent of \(\theta \), maximizing the ELBO automatically leads to minimizing the KL divergence. Consequently, the optimization problem in Eq. (3) can be rewritten into

$$\begin{aligned} q^{*}(\theta )= & {} \mathrm {arg\, min}\,\textrm{KL}(q(\theta ) \Vert p(\theta |\mathcal {D})) \nonumber \\= & {} \mathrm {arg\, max} \int q(\theta )\log p(\textbf{Y}|\textbf{X}, \theta ){\text {d}}\!\,\theta -\textrm{KL}(q(\theta ) \Vert p(\theta ))~. \end{aligned}$$
(5)

In Eq. (5), it can be seen that the intractable evidence \(p(\textbf{Y}|\textbf{X})\) is not required as the posterior distribution \(p(\theta | \mathcal D)\) is avoided. Instead, we only need to process the known priori distribution \(p(\theta )\) and the likelihood \(p(\textbf{Y}|\textbf{X}, \theta )\), which are easier to handle. In this way, we simplify the inference problem into a solvable and tractable optimization problem.

If we examine Eq. (5) again, we notice that the first part of the equation corresponds to a maximum likelihood estimator and the second part of the equation adds a prior part to the ELBO loss function to make the estimation based on the prior knowledge, hence it resembles the maximum a posteriori estimator in this aspect.

Several open-source libraries for Bayesian inference and probabilistic modeling for machine learning have already been developed. Table 1 includes a comparison and summary of the most popular probabilistic programming language tools. In our work and research, we usually use Pyro and Tensorflow Probability (TFP), depending on the deep learning frameworks we choose. The two libraries are preferred because they are under active development and are associated with well-known frameworks for deep NNs.

Table 1. Current open-source libraries for Bayesian inference and probabilistic modeling. All libraries support VI and MCMC methods.

3.4 Laplace Approximation

The Laplace Approximation is based on a relatively simple idea [10]. Often, only the region around the maximum is of interest in the posterior distribution. Provided that this maximum is known, the distribution in its neighborhood can be approximated with a Taylor-series expansion. The maximum \(\theta _\text {MAP}\) can be determined by classical training via gradient descent. For the Taylor-series expansion up to the second order we can obtain

$$\begin{aligned} \log p(\theta | \mathcal {D}) \approx \log p(\theta _\text {MAP}|\mathcal {D}) - \frac{1}{2} (\theta - \theta _\text {MAP})^\top \textbf{H}\, (\theta - \theta _\text {MAP}) \end{aligned}$$
(6)

with the Hessian matrix \(\textbf{H}\). The first term of the Taylor-series expansion vanishes as we expand around a maximum. The Taylor-series expansion around \(\theta _\text {MAP}\) thus results in a normal distribution of the form

$$\begin{aligned} p(\theta | \mathcal {D}) \approx p(\theta _\text {MAP}|\mathcal {D}) \cdot \exp \left( - \frac{1}{2} (\theta - \theta _\text {MAP})^\top \textbf{H}\, (\theta - \theta _\text {MAP}) \right) \end{aligned}$$
(7)

with Hessian matrix being the inverse of the covariance matrix. However, this approximation of \(p(\theta |\mathcal {D})\) is not normalized and thus not yet a valid probability density. With the normalization factor of a multivariate normal distribution we obtain

$$\begin{aligned} p(\theta | \mathcal {D}) \approx \frac{1}{\sqrt{|2 \pi \textbf{H}}|} \exp \left( - \frac{1}{2} (\theta - \theta _\text {MAP})^\top \textbf{H}\, (\theta - \theta _\text {MAP}) \right) ~. \end{aligned}$$
(8)

A limiting factor is the calculation of the Hessian matrix, since it can quickly become very large for a huge number of parameters. In practice, therefore, further approximation methods are often used to calculate \(\textbf{H}\). This allows the method to be used even for large NNs. An advantage of the Laplace approximation is that it can also be applied to already trained NNs to add an uncertainty component in a post-hoc fashion. Since the method approximates the true distribution only locally, the approximated distribution can in principle deviate strongly from the true one. In most cases, however, especially for large data sets, satisfactory results can be achieved with the Laplace approximation.

3.5 Kalman Filter-Based Approaches

While VI treats the Bayesian inference problem as an optimization problem using the ELBO loss function, many other researchers attempt to treat this problem as a filtering problem using Kalman filters. Classical Kalman filters are applicable only to linear systems; however, many variants extend them quite well to nonlinear systems. Singhal and Wu presented the first algorithm in 1989 that uses the extended Kalman filter to train BNN [11]. Compared to the normal gradient-based and batch-based methods, such Kalman filter-based methods proved to be much more effective than the standard backpropagation in terms of the number of training epochs [12]. Watanabe and Tzafesta proposed a different approach in [13], in which the weights of the networks are assumed to be Gaussian distributed and the mean and variance of each weight are updated using an extended Kalman filter, but which requires local linearization for updating the neurons in hidden layers. This method was extended by Puskorius and Feldkamp [14] to allow for layer-wise correlated or even network-wide correlated neurons.

To avoid linearization, Huber proposes in [15] the so-called Bayesian Perceptron. Even though it is restricted to a single neuron, this work proves that a closed-form computation of the mean and covariance matrix of the posterior weight distribution is possible, provided that the weights are assumed to be Gaussian distributed. Based on the Bayesian Perceptron, Wagner et al. in [16] extended this method from a single neuron to a multilayer perceptron (MLP) and called it Kalman Bayesian Neural Network (KBNN). In this work, a closed-form forward and backward propagation of the weight distributions in each layer is introduced. This method shows its advantages in terms of online learning capacity and learning efficiency compared to other popular BNN methods such as VI and MCMC. Figure 3 shows a comparison between different BNN methods. Among them, KBNN has the best learning efficiency. Chen et al. introduced another method [17] in 2021, which uses the ensemble Kalman filter to handle the measurement noise in the data and to account for it in the uncertainty quantification.

Fig. 3.
figure 3

(Source: own illustration).

Comparison between Gaussian Process (GP), Stochastic VI, MCMC method and KBNN on a synthetic classification dataset. The first row shows the predictions for the binary classes. The second row shows the uncertainty quantification of the predictions in the data space. The three BNN variants have similar higher uncertainty in the transition region between classes

3.6 Markov Chain Monte Carlo

Another popular and well-researched approach for Bayesian inference to learn BNNs is the Markov Chain Monte Carlo (MCMC) method. A Markov chain describes a process in which the current state depends only on the last state. Under certain assumptions and starting from an initial state, a Markov chain always converges to a stationary distribution. Once this distribution is reached, all further states will correspond to this distribution. However, it is in general not known a priori how many steps along the Markov chain are necessary until the stationary distribution is reached.

In the context of Bayesian inference, Monte Carlo methods without Markov chains form a broad class of sampling algorithms that use repeated random sampling to generate samples from complex posterior distributions. Rejection sampling [23] is the basic Monte Carlo method to generate samples from a given distribution. However, rejection sampling is very inefficient because the samples are uncorrelated. Combining these sampling methods with various algorithms to construct Markov chains for the desired probability distribution (e.g., the posterior distribution of weights in BNNs) yields MCMC methods whose stationary distribution are proportionate to the desired posterior distribution. Hence, the samples of the stationary distribution represent an approximation of the posterior distribution and thus, characteristic parameters of this distribution such as the mean or the variance can be calculated.

One of the earliest MCMC methods is the Metropolis-Hastings algorithm [24, 25]. Many improvements have been proposed for that, such as Gibbs sampling [26], hybrid Monte-Carlo or Hamiltonian-Monte-Carlo (HMC) [27]. A major extension of HMC is the No-U-Turn Sampler (NUTS) [28], which usually works much more efficiently. The libraries listed in Table 1 support MCMC methods as well as VI.

4 Conclusion

In this paper, we have presented BNNs as an approach to achieve uncertainty quantification in the field of deep learning. Similar to other Bayesian inference problems, BNNs assume a prior probability distribution of the weights of the NN and attempt to learn a posterior distribution of these weights using the available data.

Several methods have been proposed in the literature to realize the training and inference of BNNs as efficiently as possible. Dropout and ensemble methods are among the simple variants, where the learning and inference of BNNs is considered as aggregation through an ensemble of different networks. However, the weights are considered to be deterministic. In the other, “real” Bayesian methods, the weights of the NN are modeled as random variables. Variational inference formulates the learning task as an optimization problem, which is approximately solved by using the ELBO loss function. In comparison, MCMC algorithms use Markov chains to draw samples from the posterior distribution of weights. While the Laplace approximation utilizes the local Taylor-series expansion to obtain an additional uncertainty quantification on trained NNs, Kalman filter-based methods propose to formulate the calculation of the posterior distribution of the weights as a filtering problem in which the weights are recursively updated.

Compared to conventional NNs, BNNs offer advantages in active learning, causal inference, out-of-distribution detection, and security-related use cases thanks to the capability of uncertainty quantification. In the future, we focus on the scalability of different methods and the applications of BNNs in novel fields, such as reinforcement learning and verification of AI. We believe that uncertainty quantification will be crucial for increasing the reliability and explainability of AI systems, which are key elements for digital sovereignty at the level of robustness and trustworthiness.