1 Introduction

The Bayesian paradigm provides a convenient and attractive framework for performing inference in statistical models, allowing for the incorporation of prior knowledge and, therefore, regularization of the effects of interest. However, the posterior distribution resulting from Bayes’ theorem is, beyond simple conjugate cases, in general not analytically tractable. The invention of Markov chain Monte Carlo (MCMC) simulation techniques has revolutionized the applicability of Bayesian inference even in very complex statistical models, providing sampling-based numerical access to the posterior. MCMC provides access to exact posterior even for small samples, including exact uncertainty quantification also for complex functionals of the original model parameters. On the downside, however, MCMC is also known for being notoriously slow due to its sequential construction and it requires careful monitoring of mixing and convergence towards the (unknown) stationary distribution, often including the adaptive choice of tuning parameters. Hence, there is renewed interest in approximate approaches for Bayesian inference that bypass the need for MCMC sampling techniques at the cost of only approximate access to the posterior.

One such approach that has gained considerably in popularity, especially in machine learning, is variational inference (VI) also called variational Bayes. The basic idea is to find the optimal approximation to the posterior distribution within a pre-specified class of variational distributions by searching for the parameters of the approximating distribution with a deterministic optimization scheme (Ormerod and Wand 2010; Blei et al. 2017). In contrast to stochastic optimization techniques such as MCMC, the direct optimization of an objective function promises much faster inference. However, depending on the complexity of the approximating family chosen for VI, the approximate posterior may not capture all aspects of the true posterior distribution and, in particular, it has been reported that simple VI approaches may considerably underestimate uncertainty attached to the parameters of interest (Bishop 2006, Ch. 10). This is particularly the case for the simplest of VI, mean-field VI (MFVI), where the variational family assumes (blocks of) parameters to be mutually independent. This assumption significantly reduces the complexity of the approximation problem and often enables fast optimization steps resembling the structure of Gibbs updates in MCMC. However, the restrictive assumption of posterior independence is often at odds with the true posterior such that MFVI provides sensible point estimates but may severely underestimate parameter uncertainty.

As a consequence, various approaches beyond simple MFVI have been suggested (as reviewed, for example, in Zhang et al. 2018). One obvious remedy is to combine as many parameters as possible in one block such that one multivariate variational distribution is constructed, therefore mitigating the mean-field assumption (see for example Hui et al. 2019; Luts et al. 2014). However, this comes at the price of determining a fully unstructured covariance matrix for all parameters simultaneously, which requires handling of large matrices, especially for a large number of effects. An alternative is the semi-implicit VI (SIVI) approach recently developed by Yin and Zhou (2018). Compared to MFVI, it increases the complexity of the variational distribution allowing for some parameter dependencies. Firstly, SIVI uses a hierarchical construction of the variational parameters to bring back parameter dependencies based on hierarchical VI (Ranganath et al. 2016). Secondly, the mixing distribution on the higher level of the hierarchy does not need to be an analytic probability density function, meaning a highly flexible implicit distribution can be chosen, i.e. the distribution is not required to have an analytic probability density function but samples can be generated from it (Diggle and Gratton 1984). While this approach brings simulations back into the inferential procedure, the underlying reasoning relies on law of large numbers asymptotics which are much easier to control and monitor than the distributional convergence of a Markov chain towards its limiting stationary distribution.

In this paper, we are focusing on semiparametric additive models as a particularly important special case of statistical modelling where Bayesian inference has gained considerable interest and both Gibbs sampling (Lang and Brezger 2004) and simple MFVI (Luts and Wand 2015; Waldmann and Kneib 2015; Hui et al. 2019) have been developed. More precisely, we

  • review different forms of VI, including MFVI and SIVI, in their general form,

  • develop their specific forms in semiparametric additive models including an improved MFVI approach where all regression coefficients associated with the additive components are combined in one block following ideas developed in Luts and Wand (2015) and Hui et al. (2019) and a combination of SIVI and MFVI (SIMFVI) that leads to more robust results and speeds up the optimization compared to the SIVI approach,

  • investigate the performance of the different forms of VI with a specific focus on quantifying uncertainty in simulations to provide guidance on their reliability and applicability, and

  • apply the methods to a data set on tree height of Douglas fir in a large-scale forestry data set.

We find that SIVI and SIMFVI effectively restore parameter uncertainty such that local and simultaneous credible intervals are accurately represented. However, the improved version of MFVI shows comparable performance such that combining all regression parameters in one block and therefore incorporating across effect dependence seems to be the crucial aspect in constructing an appropriate approximating distribution.

The structure of this article is as follows: In Sect. 2, we briefly introduce the necessary background on Bayesian additive regression models. Section 3 describes the methodology of VI and derives the algorithms for the different forms of MFVI and SIVI both in general and in the context of additive models. In Sect. 4, we compare all introduced methods and the Gibbs sampler in a simulation study with a focus on uncertainty quantification. Section 5 describes an application of the presented methods to tree heights of Douglas fir. In the final section, we summarize our results and briefly discuss limitations and potential directions for future research.

2 Bayesian additive models

We consider Bayesian forms of semiparametric additive models for regression data \((y_i, \varvec{x}_i)\), \(i=1,\ldots ,n\) where \(y_i\) denotes the response variable and \(\varvec{x}_{i}\) is a vector of explanatory variables of different type. More specifically, we assume the model structure

$$\begin{aligned} y_i = \sum _{j=1}^{p}f_j(\varvec{x}_{ij})+ \epsilon _i, \end{aligned}$$

where \(\epsilon _i \sim \mathcal {N}(0, \sigma ^2)\) represents the independent Gaussian error term while the effect of the covariates is additively decomposed into p effects \(f_j(\cdot )\) that may represent linear, nonlinear, clustered (random), or spatial effects (among others) in a generic form. Each of the effects is then expanded in \(d_j\) basis functions as

$$\begin{aligned} f_j(\varvec{x}_{ij}) = \sum _{l=1}^{d_j}\gamma _{jl}B^j_l(\varvec{x}_{ij}) \end{aligned}$$

with effect-specific basis functions \( B^j_l(\varvec{x}_{ij})\) and corresponding basis coefficients \(\gamma _{jl}\). In vector–matrix notation, this implies the model

$$\begin{aligned} \varvec{y}&= \varvec{Z}_1\varvec{\gamma }_1 + \ldots + \varvec{Z}_p\varvec{\gamma }_p + \varvec{\epsilon } = \varvec{Z}\varvec{\gamma } + \varvec{\epsilon } \end{aligned}$$
(1)

where \(\varvec{y}\) and \(\varvec{\epsilon }\) are vectors of responses and error terms, the design matrices of basis function evaluations are denoted as \(\varvec{Z}_j\) and \(\varvec{\gamma }_j\) are the corresponding vectors of basis coefficients. Stacking all design matrices and basis coefficients into the matrix \(\varvec{Z}\) and \(\varvec{\gamma }\) yields the final representation as a large linear model.

To regularize the estimation of the basis coefficients, we employ multivariate normal priors

$$\begin{aligned} p(\varvec{\gamma }_{j}|\tau _j^2)&\propto \frac{1}{\big (2\pi \tau _j^2\big )^{\frac{\text {rank}(\varvec{K}_j)}{2}}} \text {exp}\left( -\frac{\varvec{\gamma }_j'\varvec{K}_j\varvec{\gamma }_j}{2\tau _j^2}\right) , \end{aligned}$$
(2)

with zero mean and precision matrix \(\varvec{K}_j / \tau _j^2\). The precision matrix is chosen to reflect desirable regularization properties such as smoothness or shrinkage and may contain a non-trivial null space rendering Equation (2) into a partially improper prior specification. The impact of the prior on the posterior is regulated by the prior variance parameter \(\tau _j^2\). In the remainder of this paper, we will employ weakly informative inverse gamma priors \(\tau _j^2 \sim \text {IG}(a_j, b_j) \), with default values of \( a_j = b_j = 0.1\), but other prior distributions are easily conceivable. Similarly, we assign weakly informative inverse Gamma priors to \(\sigma ^2 \), \(\sigma ^2 \sim \text {IG}(a_{\sigma ^2}, b_{\sigma ^2}) \), with the same default values. Analytic forms of the distribution of the likelihood and the priors are shown in Appendix Sect. 7.3.

Each effect type takes a specific form by choosing the basis functions \(B^j_l(\varvec{x}_{ij})\) and the penalty matrix \( \varvec{K}_j\) (see Fahrmeir et al. 2021, for details):

  • For linear effects, the basis functions are the untransformed covariates, \( \varvec{Z}_j = \varvec{x}_{\cdot j} \), where \( \varvec{x}_{\cdot j} \) is a row vector representation of covariate j and a flat prior is obtained by setting \(\varvec{K}_j = \varvec{0}\).

  • In the case of clustered “random” effects, the basis functions represent dummy coding for the grouping variables and the penalty matrix equals the identity matrix, i.e. \( \varvec{K}_j = \varvec{I}_j\).

  • For nonlinear effects of continuous covariates, we use Bayesian P-splines (Lang and Brezger 2004) that are based on B-Spline basis functions in combination with a penalty matrix based on the kth-order random walk prior, e.g. a second-order random walk defined as \({\gamma }_{jl} = 2{\gamma }_{j,l-1} - {\gamma }_{j,l-2} + u_j,\) with Gaussian errors \( u_j \sim \mathcal {N}(0,\tau _j^2)\) and flat priors for \(\gamma _{j1}\) and \(\gamma _{j2}\). In this way subsequent coefficients are penalized leading to a smoother functional form. The penalty matrix can then be constructed based on a difference matrix \( \varvec{D}_j \) such that \( \varvec{K}_j = \varvec{D}_j'\varvec{D}_j \).

  • The concept of Bayesian P-splines can be extended to bivariate tensor product P-splines for fitting spatial effects or interaction surfaces \(f_j(\varvec{x}_{j1}, \varvec{x}_{j2}) \). This is achieved by combining the two univariate spline basis matrices \( \varvec{Z}_{j1} \) and \( \varvec{Z}_{j2} \) in terms of all \( d_{j1} \cdot d_{j2} \) pairwise interactions. The penalty matrix is constructed by combining the two univariate spline penalties, \( \varvec{K}_{j1} \) and \( \varvec{K}_{j2} \) to \(\varvec{K}_{j} = \varvec{K}_{j1} \otimes \varvec{I}_{d_{j2}} + \varvec{I}_{d_{j1}}\otimes \varvec{K}_{j2}\) such that smoothness is enforced in both covariate directions, see also Appendix Sect. 7.1.

For univariate and bivariate effects as described here, the following two points should be considered. Firstly, the penalty matrix \( \varvec{K}_j \) is rank deficient and therefore the prior is improper. However, it can be shown that the resulting posterior is still proper (see Appendix Sect. 7.4). Secondly, further restrictions need to be imposed to ensure the identifiability of the model. We use the restriction of a centering constraint in the design matrix (see Appendix Sect. 7.2 for more details).

3 Variational inference in additive model

Variational inference (VI), as used in the Bayesian framework, casts the integration problem associated with obtaining the posterior distribution into an optimization problem. During the optimization, VI searches among a set of candidate distributions for the one approximating the posterior distribution best. If the set of candidate distributions approaches the complexity of the true distribution, VI promises to be computationally faster than MCMC while the quality of the results can be comparable. For instance, You et al. (2014) and Wang and Blei (2019) showed consistency for the VI approach in additive models. However, the procedure requires careful choices to be made which determine the quality of the approximation:

  • The variational family \(\mathcal {Q}\), i.e. the set of candidate distributions since a misspecification will directly limit the quality of the estimated posterior.

  • The measure determining the quality of an element of the variational family relative to the exact posterior distribution. The classical divergence measure is the Kullback–Leibler-divergence (KL-divergence), but also other more general measures as described in’ Zhang et al. (2018) are possible.

  • The algorithm to searching for the best approximating variational distribution by finding the best combination of variational parameters \(\varvec{\psi }\) by optimizing the divergence measure. Again, Zhang et al. (2018) discuss different aspects including algorithms and strategies for variance reduction in the context of stochastic VI.

General overviews of variational inference are given in Bisho (2006, Ch. 10), Ormerod and Wand (2010) and Blei et al. (2017). In this paper, we only address the first point and describe possible extensions to the variational family \(\mathcal {Q}\).

In the following, we introduce four different variational families used in this article to approximate the posterior distribution arising in Bayesian additive models. Two of the approaches presented are based on mean field approximations (see Sects. 3.2.1 and 3.2.2) while the remaining families proposed are based on the idea of semi-implicit variational inference (SIVI, Yin and Zhou 2018, see Sects. 3.2.3 and 3.2.4).

We denote the vector of model parameters as \(\varvec{\theta }\) and its posterior density with \(p(\varvec{\theta }| \varvec{y})\). The elements of the variational family \(\mathcal {Q}\), i.e. the variational distributions, are denoted as \(q_{\varvec{\psi }}\) where \(\varvec{\psi }\) is the vector of variational parameters. The density of the variational distribution is denoted as \(q_{\varvec{\psi }}(\varvec{\theta })\).

To measure the deviation between the variational distribution \(q_{\varvec{\psi }}\) and the posterior distribution the Kullback–Leibler (KL) divergence,

$$\begin{aligned} {\text {KL}}(q_{\varvec{\psi }}(\varvec{\theta })||p(\varvec{\theta }|{\textbf {y}})) = \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,q_{\varvec{\psi }}(\varvec{\theta })] - \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,p(\varvec{\theta }, {\textbf {y}})] + \log \,p({\textbf {y}}), \end{aligned}$$

is used (Jordan et al. 1999; Ormerod and Wand 2010). The KL divergence decreases with increasing similarity of the two distributions and is zero for two identical distributions. Hence, we want to find \(\varvec{\psi }^\star \) minimizing the KL divergence. Instead of working directly with the KL divergence, the minimization problem is reformulated as an equivalent maximization problem. Precisely, \(\varvec{\psi }^\star \) is determined by maximizing the evidence lower bound (ELBO),

$$\begin{aligned} \mathcal {L} (\varvec{\psi })&= \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,p(\varvec{\theta }, {\textbf {y}})] - \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,q_{\varvec{\psi }}(\varvec{\theta })], \end{aligned}$$

not containing the intractable marginal likelihood or model evidence \(p({\textbf {y}})\) which does not depend on \(\varvec{\psi }\). The ELBO serves as the lower bound to the model evidence.

3.1 Mean-field and semi-implicit VI

3.1.1 Mean-field VI

Mean-field variational inference (MFVI, Parisi 1988; Saul and Jordan 1998) is based on a strong simplification assuming the posterior distribution can be approximated using independent parameter blocks, thus allowing to express the variational density as a product of the independent densities of the parameter blocks. The advantage of this simplification lies in easing the computation (Wainwright and Jordan 2008, p. 127-147) and the resulting speed gains. An iterative optimization scheme can be constructed by iteratively updating the variational parameters associated with one sub-vector such that the update mechanism maximizes the ELBO in each step. For example, the coordinate ascend variational inference algorithm (CAVI, Bishop 2006, Ch. 10) works in this way.

Suppose, the vector of model parameters is divided into p sub-vectors such that \(\varvec{\theta }= (\varvec{\theta }_1', \dots , \varvec{\theta }_p')'\). Using the MFVI approach, the variational density can be expressed as \( q_{\varvec{\psi }}(\varvec{\theta }) = \prod _{i=1}^p q_{\varvec{\psi }_i}(\varvec{\theta }_i), \) where \(\varvec{\psi }_i\) are the variational parameters or a set of variational parameters associated with the variational distribution of the i-th subvector of \(\varvec{\theta }\). Now the variational density of the i-th sub-vector maximizing the ELBO is

$$\begin{aligned} q^\star (\varvec{\theta }_i) \propto \text {exp}\left\{ \mathbb {E}_{\varvec{\theta }_{-i} \sim q_{\varvec{\psi }_{-i}}}\left[ \log (p(\varvec{y} | \varvec{\theta }) p(\varvec{\theta })) \right] \right\} , \end{aligned}$$
(3)

where \(\varvec{\theta }_{-i}\) denotes the parameter vector \(\varvec{\theta }\) without the i-th sub-vector and \(q_{\varvec{\psi }_{-i}}\) the variational distribution with the associated variational parameters in the index of said vector (Bishop 2006, Ch. 10). When selecting \(q_{\varvec{\psi }_{-i}}\) suitably and exploiting conditional conjugacy, a closed-form solution to \(q^\star \) can be constructed by updating the variation parameter \(\varvec{\psi }_i\), similar to the parameters describing the sampling distribution in a Gibbs update step. CAVI then repeatedly iterates over i to update \(\varvec{\psi }_i\) until convergence of the ELBO.

3.1.2 Semi-implicit VI

SIVI (Yin and Zhou 2018) builds upon the idea of hierarchical variational inference (HVI) proposed by Ranganath et al. (2016) to reintroduce dependencies between the parameter blocks that are assumed independent in MFVI. To illustrate the concept of HVIt, suppose, we have three parameter blocks \(\varvec{\theta }= (\varvec{\theta }_1, \varvec{\theta }_2, \varvec{\theta }_3)\) and for the variational parameters, namely, \(\varvec{\psi }_2\) and \(\varvec{\psi }_3\), a variational hyper-distribution \(q_{\varvec{\phi }}\) is assumed. The variational density of \(\varvec{\theta }\) with the variational parameters \(\varvec{\psi }_1\) and \(\varvec{\phi }\) can then be expressed as

$$\begin{aligned} q_{\varvec{\psi }_1, \varvec{\phi }}(\varvec{\theta })&= q_{\varvec{\psi }_1}(\varvec{\theta }_1) \int q_{\varvec{\psi }_2}(\varvec{\theta }_2) q_{\varvec{\psi }_3}(\varvec{\theta }_3) q_{\varvec{\phi }}(\varvec{\psi }_2, \varvec{\psi }_3) \text {d}\varvec{\psi }_2 \text {d}\varvec{\psi }_3. \end{aligned}$$
(4)

Thus, the dependency between \(\varvec{\theta }_2\) and \(\varvec{\theta }_3\) in the posterior can be restored in the variational distribution via dependency between \(\varvec{\psi }_2\) and \(\varvec{\psi }_3\) introduced via \(q_{\varvec{\phi }}\). The expansion of the variational family comes at the expense of increasing the computational burden. Hence, there is a trade-off between choosing MFVI as the faster optimization method and HVI which gives better approximations to the posterior in more complex settings but slows down the computational speed.

SIVI takes the idea of HVI a step further by allowing \(q_{\varvec{\phi }}\) to be an implicit distribution, meaning a distribution for which the density cannot be evaluated but for which we can sample from. This renders Equation (4) not analytical solvable and we cannot access the ELBO directly. Instead, the authors suggest constructing a lower bound to the ELBO. More precisely, the lower bound \(\tilde{\mathcal {L}_0}\) is constructed as,

$$\begin{aligned} \mathcal {L} (\varvec{\psi }_1, \varvec{\phi })&= -\mathbb {E}_{(\varvec{\psi }_2, \varvec{\psi }_3) \sim q_{\varvec{\phi }}}{\text {KL}}(q_{\varvec{\psi }}(\varvec{\theta })||p(\varvec{\theta }|{\textbf {y}})) + \log \,p({\textbf {y}})\\&\ge -{\text {KL}}(\mathbb {E}_{(\varvec{\psi }_2, \varvec{\psi }_3) \sim q_{\varvec{\phi }}}q_{\varvec{\psi }}(\varvec{\theta })||p(\varvec{\theta }|{\textbf {y}})) + \log \,p({\textbf {y}}) = \widetilde{\mathcal {L}}_0 (\varvec{\psi }_1, \varvec{\phi }) \end{aligned}$$

using Jensen’s inequality with the observation that the KL-divergence can be viewed as a convex functional (a proof is provided in Yin and Zhou 2018, Appendix A).

\(\tilde{\mathcal {L}_0}\) can be used to as a target optimize the variational parameters. In practice, the implicit distribution is implied by the transformation of some noise \(\varvec{\varepsilon } \sim \mathcal {D}\) (e.g. \(\mathcal {D}\) is the k-dimensional standard normal distribution) with a deep neural net such that \((\varvec{\psi }_2', \varvec{\psi }_3')' = (T_{\varvec{\phi }}(\varvec{\varepsilon })_1', T_{\varvec{\phi }}(\varvec{\varepsilon })_2')' = T_{\varvec{\phi }}(\varvec{\varepsilon })\). However, Yin and Zhou (2018) show in Proposition 1 that optimizing \(\tilde{\mathcal {L}_0}\) without early stopping can lead to a degenerated distribution for \(\psi _1, \psi _2\), i.e. a distribution with a single point-mass. To avoid this, the author suggest to add a regularizing term to \(\tilde{\mathcal {L}_0}\) yielding

$$\begin{aligned}{} & {} \widetilde{\mathcal {L}}(\varvec{\psi }_1, \varvec{\phi }) = \mathbb {E}_{\varvec{\varepsilon } \sim \mathcal {D}} \mathbb {E}_{\varvec{\theta }\sim q_{(\varvec{\psi }_1, T_{\varvec{\phi }}(\varvec{\varepsilon }))}} \mathbb {E}_{\varvec{\varepsilon ^{(1)}, \dots , \varvec{\varepsilon ^{(K)}} \sim \mathcal {D}}} \Bigg [ \log p(\varvec{\theta }, {\textbf {y}}) - \log q_{\varvec{\psi }_1}(\varvec{\theta }_1) \nonumber \\{} & {} \quad -\log \prod _{i=2}^{3}\bigg ( \frac{1}{K+1} \Big ( q_{T_{\varvec{\phi }}(\varvec{\varepsilon })_i}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon })_i) + \sum _{k=1}^{K} q_{T_{\varvec{\phi }}(\varvec{\varepsilon }^{(k)})_i}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon }^{(k)})_i) \Big ) \bigg ) \Bigg ] \end{aligned}$$
(5)

as the target for optimization to which refer from here on as the lower bound ELBO (lbELBO).

Yin and Zhou (2018) show that with increasing K the lbELBO approaches the ELBO reaching equality for \(K \rightarrow \infty \). The expectations in the lbELBO can be estimated via stochastic approximation. Note that the conditional densities \( q_{T_{\varvec{\phi }}(\varvec{\varepsilon })_i}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon })_i) \) can also include non-hierarchical variational parameters, e.g. \( q_{T_{\varvec{\phi }}(\varvec{\varepsilon })_i, \varvec{\psi }_{i,2}}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon })_i) \) with additional fixed parameters \( \varvec{\psi }_{i,2} \). Finally, updates to the variational parameters are based on the respective gradients. The gradients are available via reverse-mode automatic differentiation exploiting the reparametrization trick (Kingma and Welling 2014). In particular, the updates at iteration \(\ell \) are given by

$$\begin{aligned} \varvec{\phi }^{(\ell )}&= \varvec{\phi }^{(\ell - 1)} + \rho _1^{(\ell )} \,\nabla _{\varvec{\phi }}\,\widetilde{\mathcal {L}}(\varvec{\psi }_1, \varvec{\phi }), \\ \varvec{\psi _1}^{(\ell )}&= \varvec{\psi _1}^{(\ell - 1)} + \rho _2^{(\ell )} \,\,\nabla _{\varvec{\psi _1}}\,\widetilde{\mathcal {L}}(\varvec{\psi }_1, \varvec{\phi }), \end{aligned}$$

with exponential decaying learning rates \(\rho _1^{(l)}, \rho _2^{(l)}\). Adding decaying learning rates improved numerical stability and showed better results overall.

The flexibility of the variational family in SIVI is only limited in two ways: First, the implicit variational prior distribution must be reparameterizable. That is, a distribution that can be sampled from using an auxiliary variable \( \varvec{\epsilon } \) that is transformed through a differentiable transformation T(.) e.g. \( \varvec{\psi } = T_{\varvec{\phi }}( \varvec{\epsilon }) \) with \( \varvec{\epsilon } \sim \mathcal {N}(\varvec{0},\varvec{I})\). Second, the conditional variational distribution of the coefficients must be analytic and reparameterizable or, as demonstrated in Yin and Zhou (2018), the ELBO must be analytic.

3.2 Mean-field and semi-implicit VI for additive models

In this section, we discuss the concrete implementations of two MFVI approaches and two SIVI approaches for the additive model.

3.2.1 Mean-field VI with block-diagonal covariance matrix

In the additive model, the mean field factorization could be blocked as follows \(\mathbf {\theta }= (\mathbf {\gamma }_1',\ldots ,\mathbf {\gamma }_p', \tau _1, \dots , \tau _p, \sigma ^2)'\) with the variational density given by the factors

$$\begin{aligned} q_{\varvec{\psi }}(\varvec{\theta }) =q_{\varvec{\psi }_{\sigma ^2}}(\sigma ^2) \prod _{j=1}^{p} q_{\varvec{\psi }_{\varvec{\gamma }_j}}(\varvec{\gamma }_{j}) \, q_{\varvec{\psi }_{\tau ^2_j}}(\tau ^2_j), \quad q_{\varvec{\psi }} \in \mathcal {Q}_{\text {MFb}}, \end{aligned}$$

(Waldmann and Kneib 2015). Using Equation (3) and exploiting conditional conjugacy results in each \(q_{\varvec{\psi }_{\varvec{\gamma }_j}}\) to be multivariate Gaussian and the remaining distributions as inverse gamma parametrized with the parameter vector in the index.

This formulation allows the construction of iterative updates to the variational parameters as follows: For \(\varvec{\psi }_{\varvec{\gamma }_j}\) as re-parametrization of the mean vector and covariance matrix, i.e. \(\varvec{\psi }_{\varvec{\gamma }_j} = (\varvec{\mu }_j, \varvec{\Sigma }_j)\), in the variational distribution of \(\varvec{\gamma }_j\), the updates are

The variational distribution \( q^*(\sigma ^2)\) of the error variance is and the updates for the variational parameters are

(6)
(7)

where \( \varvec{Z} = (\varvec{Z}_1, \ldots , \varvec{Z}_p) \) and \(\varvec{\mu } =(\varvec{\mu }_1', \ldots , \varvec{\mu }_p')'\) are the stacked design matrices and mean vectors, respectively. For the covariance matrix \( \varvec{\Sigma } \), we can rewrite the component-wise covariance matrices into one block diagonal covariance matrix, i.e. \(\varvec{\Sigma } = \text {blockdiag}(\varvec{\Sigma }_1, \dots , \varvec{\Sigma }_p) \). Therefore, we call this approach MFVI with block-diagonal covariance matrix MFVI (block).

The variational distributions \( q^*(\tau _j^2) \) of the smoothing parameters are \(\text {IG}(\nu _{a_j}, \nu _{b_j}), \;\forall \; j=1, \dots , p\) and the updates for the variational parameters are

$$\begin{aligned} \nu _{a_j}&= a_j + \frac{\text {rank}(\varvec{K}_j)}{2} , \end{aligned}$$
(8)
$$\begin{aligned} \nu _{b_j}&= b_j + \frac{1}{2} \left( \text {tr}\left( \varvec{K}_j\varvec{\Sigma }_j\right) + \varvec{\mu }_j'\varvec{K}_j\varvec{\mu }_j\right) . \end{aligned}$$
(9)

The full derivations for the variational distribution of the coefficients, the error variance and the smoothing parameters are shown in Appendix Sect. 7.4.

3.2.2 Mean-field VI with full covariance matrix

In the special case of a multivariate Gaussian distribution for the coefficients, we can use a single multivariate Gaussian distribution for all coefficients such that the variational density factors to

$$\begin{aligned} q_{\varvec{\psi }}(\varvec{\theta }) =q_{\varvec{\psi }_{\sigma ^2}}(\sigma ^2)\, q_{\varvec{\psi }_{\varvec{\gamma }}}(\varvec{\gamma }) \prod _{j=1}^{p} q_{\varvec{\psi }_{\tau ^2_j}}(\tau ^2_j), \quad q_{\varvec{\psi }} \in \mathcal {Q}_{\text {MFf}}, \end{aligned}$$

with multivariate Gaussian distribution \(q_{\varvec{\psi }_{\varvec{\gamma }}}\). The iterative updates to the variational parameters are,

As the covariance matrix \( \varvec{\Sigma } \) is a full and unstructured covariance matrix we call the approach MFVI with a full covariance matrix MFVI (full). The updates for the error variance and the smoothing parameters are the same as in MFVI (block).

In MFVI (full), the mean-field assumption plays a crucial part but is not very restrictive. First, the assumption of independence between the error variance and the coefficients is only a mild assumption. For instance, in the case of a diminishing penalty term that is close to zero, we can use the properties of ordinary least squares. That is, all columns of the design matrix are orthogonal to the residuals. For larger influences of the penalty term, however, this assumption is violated. Second, the assumption that smoothing parameters for each component are independent and that they are independent of the error variance and conditionally on the coefficients only lead to a mild restriction as this assumption is only imposed on the hyper-parameters. However, MFVI (full) is computationally very demanding for large variational covariance matrices \( \varvec{\Sigma }\) because inverting the \(m \times m\) matrix \(\varvec{\Sigma }\) involves \(O(m^3)\) computations. Furthermore, MFVI (full) as discussed here in the case of conditional conjugate models is not very flexible, as it is limited to the case of a multivariate Gaussian variational distribution for the coefficients of all additive components. For smaller data sets, however, a multivariate Gaussian variational distribution may not capture heavier tails.

3.2.3 Semi-implicit VI

Using SIVI has the advantage of retaining a blocked covariance matrix, but at the same time increasing the flexibility of the variational distribution for the coefficients and thus restoring coefficient dependencies. Additionally, more complex posterior distributions can be captured. The variational density is

$$\begin{aligned} q_{\varvec{\phi }, \varvec{\nu }}(\varvec{\theta }) = \,&q_{\varvec{\psi }_{\sigma ^2}}(\sigma ^2) \prod _{j=1}^{p} q_{\varvec{\psi }_{\tau ^2_j}}(\tau ^2_j) \nonumber \\&\int \left( \prod _{j=1}^{p} q_{\varvec{\psi }_{\varvec{\gamma }_j}}(\varvec{\gamma }_{j}|\varvec{\psi }_{\varvec{\gamma }_j})\right) q_{\varvec{\phi }}(\varvec{\psi }_{\varvec{\gamma }}) \text {d}\varvec{\psi }_{\varvec{\gamma }}, \quad q_{\varvec{\phi }, \varvec{\nu }} \in \mathcal {Q}_{\text {HVM}}, \end{aligned}$$
(10)

with \( \varvec{\nu } = (\varvec{\psi }_{\sigma ^2}, \varvec{\psi }_{\tau ^2_1}, \dots ,\varvec{\psi }_{\tau ^2_p}) \). In this way the variational mean-field parameters for the coefficients, \(\varvec{\psi }_{\varvec{\gamma }}\), are marginalized out. The p coefficient blocks are designed to be independent conditioned on \( \varvec{\psi }_{\varvec{\gamma }} \), marginally, however, dependencies between the blocks can be captured.

In line with the variational distributions of MFVI, we choose an inverse Gamma distribution for with and for each \( \tau ^2_j \sim \text { IG}(\nu _{a_j}, \nu _{b_j})\) with \( \varvec{\psi }_{\tau ^2_j} = (\nu _{a_j}, \nu _{b_j}) \). For the conditional variational distribution, we use a multivariate Gaussian distribution, i.e. \( \varvec{\gamma }_j|\varvec{\psi }_{\varvec{\gamma }_j} \sim \mathcal {N}(\varvec{\mu }_j, \varvec{\Sigma }_{\varvec{\xi }_j}) \) with \( \varvec{\psi }_{\varvec{\gamma }_j} = (\varvec{\mu }_j, \varvec{\Sigma }_{\varvec{\xi }_j}) \). As the optimization was numerically unstable for conditioning on both \( \varvec{\mu }_j \) and \( \varvec{\Sigma }_{\varvec{\xi }_j} \) (also conditioning on variational parameters of smoothing parameters and error variance lead to unreliable results), we apply the hierarchical expansion only on the variational parameters \( \varvec{\mu } = (\varvec{\mu }_1',\dots , \varvec{\mu }_p')' \), i.e. \( \varvec{\gamma }_j|\varvec{\mu }_j \sim \mathcal {N}(\varvec{\mu }_j, \varvec{\Sigma }_{\varvec{\xi }_j})\).

The variational parameter of the coefficients are generated as \(\varvec{\mu } = T_{\varvec{\phi }}(\varvec{\epsilon }),\) with \( \varvec{\epsilon } \sim \mathcal {N}(\varvec{0}, \varvec{I}) \) and where \( T_{\varvec{\phi }}\) transforms variables \( \varvec{\epsilon } \) through a deep neural network with weight and bias parameters \( \varvec{\phi } \).

The covariance matrix \( \varvec{\Sigma }_{\varvec{\xi }_j} \) is parameterized with \( \varvec{\xi }_j \). It is the vectorized upper triangular matrix and part of the Cholesky decomposition to build the covariance matrix, i.e. \( \varvec{\Sigma }_{\varvec{\xi }_j} = g_U(\varvec{\xi }_j)'g_U(\varvec{\xi }_j) \), where \( g_U \) forms a Cholesky factor from its argument vector.

For more details about the structure of SIVI, we provide an illustration in Fig. 6 in Appendix Sect. 9.1 and a pseudo-algorithm in Algorithm 2 in Appendix Sect. 9.

The updates are determined by using a gradient-based approach. For variational hyper-parameters we use the Adam optimizer (Kingma and Ba 2015) and for the parameters \( \varvec{\xi } \) and \( \varvec{\nu } \) we use a stochastic gradient descend optimizer. In order to estimate the gradients, we rely on the lbELBO for additive models given by

(11)

with \( \varvec{\gamma }_{.,s} \) as the s-th sample of the stacked coefficient vector. All parts that include the expectation with respect to \( \sigma ^2 \) or each \( \tau ^2_j \) of Formula (11) are available in analytic form. Hence, we can use the same analytic derivations as for MFVI. The expectation with respect to parameters \( \varvec{\gamma }_1, \dots , \varvec{\gamma }_p \) is not tractable and needs to be approximated. We use stochastic approximation by taking S samples of each \( \varvec{\gamma }_{j,s}\) and average over them.

The choices of the variational distributions are in accordance with MFVI for reasons of comparison and the choice enables an analytic solution for MFVI. However, different MFVI algorithms have been developed to go beyond conditional conjugate models. In the case of SIVI, any arbitrary variational distribution for the error variance and smoothing parameter(s) can be used. Additionally, the conditional variational distribution is not restricted to be multivariate Gaussian, but other symmetric, as well as asymmetric distributions, can be considered.

3.2.4 Semi-implicit mean field VI

We propose an additional method which we call semi-implicit mean field variational inference (SIMFVI) that can be viewed as a hybrid between SIVI and MFVI. SIMFVI uses the same variational family as SIVI but the variational parameters are updated differently. Wherever possible, we use the analytical updates as in Equation (3) with noisy estimates of the expected values. For the additive model, this algorithm updates the variational hyperparameters with the gradient-based method as in SIVI. \( \varvec{\Sigma }_{j} \) and \(\varvec{\nu } \) are updated similar to the analytic MFVI (block) updates. Indeed, we need to use a stochastic approximation of the expectation with respect to \(\varvec{\gamma }_{j}\). Hence, the updates for the scale parameter of the error variance are,

$$\begin{aligned} \begin{aligned} \nu _{b_{\sigma ^2}}&= b_{\sigma ^2} + \frac{1}{2} \mathbb {E}_{\mathbf {\gamma } \sim q_{\mathbf {\psi }_{\mathbf {\gamma }}}}\bigg [ (\textbf{y} - \mathbf {Z\gamma })'(\textbf{y} - \mathbf {Z\gamma }) \bigg ] \\ {}&\approx b_{\sigma ^2} + \frac{1}{2} \bigg ( \textbf{y}'\textbf{y} - \frac{2}{S} \sum _{s=1}^{S} \mathbf {\gamma }_{.,s} \textbf{Z}'\textbf{y} + \frac{1}{S} \sum _{s=1}^{S} \mathbf {\gamma }_{.,s}'\textbf{Z}'\textbf{Z}\mathbf {\gamma }_{.,s} \bigg ), \end{aligned}\end{aligned}$$

The updates for the scale parameter of the smoothing parameter for component j are,

$$\begin{aligned} \begin{aligned} \nu _{b_j}&= b_j + \frac{1}{2} \mathbb {E}_{\mathbf {\gamma } \sim q_{\mathbf {\psi }_{\mathbf {\gamma }}}}\big [ \mathbf {\gamma }_j'\textbf{K}_j \mathbf {\gamma }_j\big ] \approx b_j + \frac{1}{2S} \sum _{s=1}^{S}{\mathbf {\gamma }_{j,s}'\textbf{K}_j \mathbf {\gamma }_{j,s}}. \end{aligned} \end{aligned}$$

We provide a full implementation of the described approaches using the Python programming language. The implementations incorporate the Python libraries numpy (Harris et al. 2020) and pytorch (Paszke et al. 2019) for generation of random numbers and automated differentiation.

4 Simulation

In the simulation study, we assess the uncertainty estimates across the introduced VI methods based on empirical coverage percentages over repeated simulations. The coverage percentage is the relative frequency of how many times the credible interval (CI) contains the true value. For a 95% CI, we, therefore, expect a coverage percentage that equals about the nominal level of 95%. In Bayesian inference the CI is either based on the highest density interval (Turkkan and Pham-Gia 1993) or on the quantiles of the distribution. In this paper the CI’s are based on the quantiles.

In the case of MFVI (block), we suppose the coverage percentage will be below the nominal level due to the strong assumption of independence between coefficient blocks. We also presume that MFVI (full) accurately captures parameter uncertainty, but it comes with the limitation of determining a fully unstructured covariance matrix for all parameters simultaneously which requires handling of large matrices. On the other hand, SIVI and SIMFVI combine the advantage of using a blocked structure with a hierarchical construction to restore parameter dependencies. Therefore, we hypothesize the coverage percentage of SIVI and SIMFVI is close to the nominal level as well. Additionally, we compare the aforementioned methods with the Gibbs sampler, which we use as a reference.

We assess both, the coverage of point-wise and simultaneous CI. For estimating simultaneous CI, we develop an efficient algorithm (see Algorithm 2 in Appendix 8) based on a fully Bayesian quantile-based approach (Krivobokova et al. 2010).

The data generating process (DGP) is based on two covariates that affect the response in a nonlinear way. Hence, for the model, we can use P-splines and can block the covariance matrix accordingly. The DGP has the form

  1. (DGP)

    \( y_i = f_1(x_{i1}) + f_2(x_{i2}) + \epsilon _i \).

The errors are independently generated from a Gaussian distribution with variance 0.5, i.e. \(\epsilon _i\sim \mathcal {N}(0,0.5)\). The values for the covariates \(x_{i1}\) and \(x_{i2}\) are generated in two steps. In the first step, we generate values from a bivariate normal distribution, \({(z_{i1}, z_{i2})' = \varvec{z}_i \sim \mathcal {N}(\varvec{0}, \varvec{A} )}\), with \( \varvec{A} \) having ones on the diagonal and the value \({\rho \in (-1,1)}\) on the off-diagonals to control for correlations between the variables. We consider correlations of varied intensity, namely, no correlation (\(\rho = 0\)), medium correlation (\(\rho = 0.45\)), and strong correlation (\(\rho = 0.9\)). In the second step, we use a probability integral transform to the variables, such that \(x_{i1} = 5\cdot F(z_{i1}) \) and \(x_{i2} = 7\cdot F(z_{i2}) - 1\), with \( F(\cdot ) \) as univariate standard normal cumulative distribution function.

The two functional forms of the nonlinear effects are,

$$\begin{aligned} f_1(x_{i1})&= \sin \left( \frac{\pi }{4} \, x_{i1} - 1 \right) + 2 \exp \left( -(x_{i1} - 1)^2\right) , \end{aligned}$$
(12)
$$\begin{aligned} f_2(x_{i2})&= \sin \left( \frac{3\,\pi }{16} \, x_{i2} - \frac{1}{2} \right) + 2 \exp \left( -\frac{3}{2}(x_{i2} - \frac{1}{2})^2 \right) . \end{aligned}$$
(13)

The functional forms \( f_1 \) and \( f_2 \) have a similar shape, which additionally increases the difficulty to distinguish between the two effects. Both functions have one sharp peak around 1.2 and 0.6, respectively (see Fig. 7, in Appendix Sect. 9.2).

Moreover, we vary the sample sizes in the simulation study. For each of the three scenarios with varying correlations, we run the simulation study using 50, 250, or 500 observations. This results in total to nine different scenarios. For each scenario, we use 1000 replications.

We argue that the simulation results of using P-splines are transferable to other effect types involving clustered or spatial effects as the coefficient structure in the model remains the same. Only the basis functions and the number of coefficients change. Hence, it appears to be sufficient to limit the extent of the simulation study to two non-linear effects modeled with P-splines.

Simulation results, depicted in Fig. 1, show the coverage percentage for each spline and method for three scenarios with the most significant differences across the methods. These are, not surprisingly, the scenarios with high or medium correlation and a rather small sample size (the results of the other scenarios are shown in Fig. 8 in Appendix Sect. 9.2).

Fig. 1
figure 1

Coverage percentage among different methods for three selected scenarios. The blue dots represent local and the yellow triangulars simultaneous CI coverages

In the scenario with a high correlation between the covariates (middle and right plot), MFVI (block) has a very low coverage for both splines. The simultaneous CI of the estimated function for \(f_2\) in the scenario with 50 observations has a coverage of below 70% that is significantly below the nominal level of 95%. SIVI, SIMFVI and also MFVI (full) show coverages of about the nominal level. For the scenario with medium correlation and 50 observations, however, the coverage percentage of simultaneous CI for \(f_2\) for MFVI (block) as well as for MFVI (full) is well below the nominal level. This might be due to the fact that for small sample sizes the Gaussian distribution assumption on the coefficients might be too restrictive. The more flexible methods SIVI and SIMFVI show improvements in the coverage, but are also slightly below the nominal level.

For other criteria such as the mean squared error (MSE) for each spline and the overall MSE on the fitted values, no significant differences across the methods are visible. There is only a slight tendency that the different MSEs of SIVI and SIMFVI are larger on average. For instance, in the scenario with 50 observations and a strong correlation the smallest overall MSE value is the one of MFVI (block) with 0.116 and the largest is the one of SIVI with 0.123 (see Table 2 in Appendix Sect. 9.2 for further details about this scenario).

The simulations show very accurate results for the Gibbs sampler across all criteria and scenarios. However, the coverage percentage of the local and, in particular, simultaneous CIs, were above the nominal level by about 0.4 to 4.2 percentage points in all scenarios, indicating that the uncertainty is slightly overestimated.Footnote 1 In particular, the simultaneous CI bands of the estimated function for \(f_1\) appear to be too wide. Nevertheless, we use the Gibbs sampler as the reference when comparing the methods in the application, as the MCMC approach is expected to give asymptotically exact results.

Assuming the samples of the MCMC approach to come from the desired posterior distribution, we also evaluate the KL divergence for each VI method based on these samples. This gives a more holistic evaluation over the complete distribution. We approximately evaluate,

$$\begin{aligned} -{\text {KL}}\left( p(\varvec{\theta |y}) || q(\varvec{\theta }) \right) \approx \frac{1}{S}\sum _{s=1}^S \log \, q(\varvec{\theta }_s) - \frac{1}{S}\sum _{s=1}^S \log \, p(\varvec{\theta }_s|\varvec{y}) \end{aligned}$$

with Gibbs samples \(\varvec{\theta }_1,..., \varvec{\theta }_S \sim p(\varvec{\theta |y})\). Since we compare the different VI methods based on the same samples we only need to evaluate the term \(\frac{1}{S}\sum _{s=1}^S \log \, q(\varvec{\theta }_s)\), that is the average logarithmized density given the Gibbs samples (ALDG). Higher values indicate better approximations to the true posterior. For SIVI and SIMFVI, we evaluate the density of the coefficients by averaging them out,

$$\begin{aligned} \log \, q(\varvec{\theta }_s) \approx \sum _j^p \left[ \log \frac{1}{\text {M}}\sum _{m=1}^\text {M} ( \, q_{\varvec{\mu }_{jm}, \varvec{\Sigma }_j}(\varvec{\gamma }_{js} | \varvec{\mu }_{jm})) + \log \, q_{a_j, b_j}(\tau ^2_{js})\right] + \log \, q_{a, b}(\sigma ^2_s), \end{aligned}$$

with M samples out of the neural network. For MFVI a full factorization applies.

The results confirm our previous findings. In Fig. 2 we show the ALDG of the coefficients for each VI method (we show figures of the total ALDG and for the different model parameters for all scenarios in Figs. 9, 10, 11, 12, 13, 14, 15, 16 and 17) in Appendix Sect. 9). MFVI (block) does not accurately capture the complete distribution, whereas SIVI and SIMFVI show significant improvements. However, evaluating the ALDG for high correlation between the covariates reveal that MFVI (full) performs slightly better. Hence, the hierarchical approach with flexible mixing distribution restores parameters dependencies to a large extend, but may fails to restore all dependencies.

Fig. 2
figure 2

The coefficient ALDG across all VI methods for three selected scenarios: 50 observations and medium correlation (left), 50 observations and high correlation (middle) and 250 observations and high correlation (right). The boxplots are based on 1000 simulations

The extend to how good SIVI and SIMFVI restores parameter dependencies is highly sensitive to the specifications of the neural network. Important considerations are the neural net structure and the activation function. We see a deterioration of the performance, if the neural net structure has more than 3 hidden layers and if the activation function is Sigmoid, instead of ReLU or Tanh. The specification for the input dimension, however, does not effect the results significantly (one example about the sensitivity is shown in Fig. 18 Appendix Sect. 9.2). Most importantly, however, is the choice of the learning rates for the parameter updates. In the simulation study it appears that in general higher learning rates (up to 0.1) improve the results, but the algorithm becomes less numerically stable. We show further details about the model set up and the choice of hyper-parameters in Appendix Sect. 7.7.

5 Application to tree height models of douglas fir

Douglas fir is a non-native conifer species to Germany. It is expected to be resilient to drought events and higher temperatures and, thus, with changing climatic conditions may serve as an important addition to the tree species portfolio of German climate-smart forestry. Modeling tree heights of Douglas fir is of high value for economic and climate consideration concerning, e.g. returns from investment and carbon storage potentials.

We use data from the national forestry inventory (NFI) of Germany and a climate data set provided by the Nordwestdeutsche Forstliche Versuchsanstalt.

We model heights of Douglas fir in Germany using two types of covariates, namely, tree- and climate-specific covariates. Some of those covariates have strong correlations. Accordingly, a method should be used that can reflect the increased uncertainty of the estimates. Therefore, we use the novel SIVI and SIMFVI methods and compare the results to the standard MFVI. Additionally, we use the Gibbs sampler as the benchmark method.

5.1 Statistical additive model

For modeling the mean tree height of Douglas fir, we employ an additive model (similar to Pya and Schmidt 2016). With the combined tree and climate data, we fit the following model for the observed tree height,

$$\begin{aligned} h_{i} \,=\;&\beta _0 + \beta _1 dbh_{i} + \beta _2 dbh^2_{i} + f_1(age_{i}) \nonumber \\&+ f_2(prec_{i})+ f_3(t_{i}) + f_4(alt_{i}) + f_{\text {geo}}(long_{i}, lat_{i}) + \epsilon _{i}, \end{aligned}$$
(14)

where we assume a Gaussian distributed error \(\epsilon _i\).

The tree-specific data are the tree height in meters (h) that we use as response variable, the DBH in decimeters (dbh), and the age (age).

For all climate-specific variables, i.e. the accumulated precipitation per tree over its lifespan (prec) and the accumulated temperature per tree over its lifespan (t), and also for the adjusted altitude per location (alt) we use nonlinear effects as well.

Finally, to account for the spatial effect, we include a tensor product spline with the approximated coordinates for each tract, i.e. the longitude and latitude (longlat). We provide more details about the model set up in Appendix Sect. 7.7.

We further split the data into 70% training and 30% test data to evaluate the predictive performance of each method. In total, we have 7,082 in the training and 3,035 observations in the test data set and overall 826 coefficients in the model.

5.2 Results

The results, as shown in Fig. 3, reveal the importance of considering climate variables to understand the tree heights of Douglas fir. Both increasing accumulated precipitation and temperature increase the expected tree height when looking at moderate values of the covariates. In case of more extreme values for accumulated temperature, the effect on tree height is less clear. For high values of accumulated precipitation, the estimated effect starts to decrease.

The estimated effect of age shows an expected functional form, basically following the typical height growth pattern over age, i.e. large height increment in younger ages and levelling off in older ages. The altitude of the tree location does not seem to play an important role.

When comparing the different proposed methods, both similarities and distinct differences are visible. The estimates for the mean effects are similar across all methods. Only SIVI deviates in some parameters. For higher values of the estimated mean effect of altitude, SIVI tends to be slightly below the other methods. Additionally, the mean of the error variance term for SIVI is with about 7.97 estimated higher compared to all other methods which estimates are in the range between 7.59 and 7.61 (see Appendix Sect. 9.3, Table 4 for more information).

The differences between the methods become more apparent when considering the CIs, in particular for the estimated effect of accumulated precipitation, accumulated temperature, and altitude. Here, we deal with the additional problem, that the covariates are correlated. The results show a similar pattern for correlated effects as discovered in the simulation study: the widths of the 95% simultaneous CI bands of MFVI (block) are narrower compared to the other methods. However, as opposed to the simulation, also SIVI is not able to match the CI widths of the Gibbs sampler. SIMFVI and MFVI (full) show results much closer to the one from Gibbs sampler. In regions with only a few observations, SIMFVI tends to have narrower CI bands compared to Gibbs and MFVI (full).

Fig. 3
figure 3

Estimated effects with 95% simultaneous CIs colored by method

The biggest differences are in the CI bands of altitude. For MFVI (block) and SIVI, the CI bands are at one occasion above and at one below the zero line. Whereas for Gibbs, MFVI (full), and SIMFVI, the CI bands cover the zero line across all values of altitude.

Similarly, the spatial effects for the different methods show differences in uncertainty levels (see Fig. 4). To highlight the differences in CI width, we visualize the CI width of each VI method as share to the width of the Gibbs sampler. Darkblue areas mean the CI is on par with Gibbs and yellow means the CI width is about 30% of the one from Gibbs, which is the lowest measured share on a location.

The biggest differences are in south-west Germany. MFVI (block) and also SIVI show much narrower CI widths at these locations compared to the other methods. Similarly, in north Germany, the areas of MFVI (block) and SIVI are lighter shaded. In areas without observations (without grey dots), there appears to be no difference between the methods.

Fig. 4
figure 4

Width of 95% simultaneous CI of the two-dimensional spline as share to the CI width of Gibbs sampler across different VI methods. 100% (darkblue) stands for the same width as Gibbs sampler. Tracts with at least one Douglas fir are marked as grey dots

The accuracy of parameter uncertainty can still be improved in SIMFVI,Footnote 2 when altering some of the parameters for the algorithm. In particular, the number of samples K out of the neural net that is used to approximate the ELBO from below. Higher values further tighten the lower bound but to the expense of computational time. We opt to draw \( K=100 \) samples out of the neural net for SIVI and SIMFVI, but increasing the number to \( K=300 \), brings the CI width slightly closer to the one of Gibbs (see an example of the spatial effect with SIMFVI in Fig. 20 in Appendix Sect. 9.3). However, it significantly comes to the expense of computational time.

The improvement of an increase in K is only marginal as can be seen when evaluating the whole distribution based on the Gibbs samples (see Table 1). SIMFVI with \( K=300 \) is marginally closer to but still slightly worse than MFVI (full). In general, the ALDG confirms the findings of improved approximations in SIMFVI over MFVI (block). There is a substantial gap between the ALDG of those 2 methods, whereas SIVI shows only minor improvements.

Table 1 Estimated ALDG across VI methods

Finally, we compare the predictive performance of each method. The predictive power is similar across all methods. For the MSE and the predictive coverage on the test data, we do not find significant distinctions between the methods. The predictive coverage is about 93 to 96% for all methods just as expected from the nominal level (see Table 5 in Appendix Sect. 9.3 for more details).

6 Conclusion

As there is growing access to ever more data resources and with it a growing interest in fast approximate methods, variational inference has gained considerably in popularity. In our analyses of additive models, variational inference performs well in terms of point estimates and parameter uncertainty, even despite making use of the strong mean-field assumption. However, the performance might degrade, and in particular, the parameter uncertainty is underestimated, if the mean-field assumption is placed on critical parameters, such as different coefficient blocks. This, might nevertheless be of interest, as treating all coefficients simultaneously, might need handling and estimation of large matrices, in particular when using combinations of spatial and cluster effects in a model, that requires numerous coefficients.

The SIVI and SIMFVI algorithms proposed are capable of using a blocked structure on the coefficients, but they still give accurate results on parameter uncertainty. In cases when a variational Gaussian distribution on the coefficients is too restrictive, SIVI and SIMFVI can even outperform MFVI with a full covariance structure, due to allowing more flexibility on the coefficients posterior distribution. Yet, the performance of SIVI seems to deteriorate, if dealing with large matrices from e.g. spatial effects or a large number of observations. In these cases, the gradient based approach to estimate parameters of the covariance matrices appears to be rather inefficient. The SIMFVI algorithm solves this issue and additionally, needs less computational time.

There is, however, still much room for improvement when considering the computational time of SIVI and SIMFVI. More efficient implementations could make SIVI and SIMFVI much faster.

We only investigate the use case of blocked versus fully unstructured covariance matrices for the coefficients. Future research can address more complex scenarios including complex hierarchical models and extensions to generalized additive models.