Variational inference: uncertainty quantification in additive models

Lichter, Jens; Wiemann, Paul F V; Kneib, Thomas

doi:10.1007/s10182-024-00492-4

Variational inference: uncertainty quantification in additive models

Original Paper
Open access
Published: 03 April 2024

Volume 108, pages 279–331, (2024)
Cite this article

Download PDF

You have full access to this open access article

AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Variational inference: uncertainty quantification in additive models

Download PDF

700 Accesses
1 Citation
Explore all metrics

Abstract

Markov chain Monte Carlo (MCMC)-based simulation approaches are by far the most common method in Bayesian inference to access the posterior distribution. Recently, motivated by successes in machine learning, variational inference (VI) has gained in interest in statistics since it promises a computationally efficient alternative to MCMC enabling approximate access to the posterior. Classical approaches such as mean-field VI (MFVI), however, are based on the strong mean-field assumption for the approximate posterior where parameters or parameter blocks are assumed to be mutually independent. As a consequence, parameter uncertainties are often underestimated and alternatives such as semi-implicit VI (SIVI) have been suggested to avoid the mean-field assumption and to improve uncertainty estimates. SIVI uses a hierarchical construction of the variational parameters to restore parameter dependencies and relies on a highly flexible implicit mixing distribution whose probability density function is not analytic but samples can be taken via a stochastic procedure. With this paper, we investigate how different forms of VI perform in semiparametric additive regression models as one of the most important fields of application of Bayesian inference in statistics. A particular focus is on the ability of the rivalling approaches to quantify uncertainty, especially with correlated covariates that are likely to aggravate the difficulties of simplifying VI assumptions. Moreover, we propose a method, where we combine both advantages of MFVI and SIVI and compare its performance. The different VI approaches are studied in comparison with MCMC in simulations and an application to tree height models of douglas fir based on a large-scale forestry data set.

Markov chain Monte Carlo with the Integrated Nested Laplace Approximation

Article 06 October 2017

Vecchia Approximations of Gaussian-Process Predictions

Article 23 June 2020

Spatial Models Using Laplace Approximation Methods

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Bayesian paradigm provides a convenient and attractive framework for performing inference in statistical models, allowing for the incorporation of prior knowledge and, therefore, regularization of the effects of interest. However, the posterior distribution resulting from Bayes’ theorem is, beyond simple conjugate cases, in general not analytically tractable. The invention of Markov chain Monte Carlo (MCMC) simulation techniques has revolutionized the applicability of Bayesian inference even in very complex statistical models, providing sampling-based numerical access to the posterior. MCMC provides access to exact posterior even for small samples, including exact uncertainty quantification also for complex functionals of the original model parameters. On the downside, however, MCMC is also known for being notoriously slow due to its sequential construction and it requires careful monitoring of mixing and convergence towards the (unknown) stationary distribution, often including the adaptive choice of tuning parameters. Hence, there is renewed interest in approximate approaches for Bayesian inference that bypass the need for MCMC sampling techniques at the cost of only approximate access to the posterior.

One such approach that has gained considerably in popularity, especially in machine learning, is variational inference (VI) also called variational Bayes. The basic idea is to find the optimal approximation to the posterior distribution within a pre-specified class of variational distributions by searching for the parameters of the approximating distribution with a deterministic optimization scheme (Ormerod and Wand 2010; Blei et al. 2017). In contrast to stochastic optimization techniques such as MCMC, the direct optimization of an objective function promises much faster inference. However, depending on the complexity of the approximating family chosen for VI, the approximate posterior may not capture all aspects of the true posterior distribution and, in particular, it has been reported that simple VI approaches may considerably underestimate uncertainty attached to the parameters of interest (Bishop 2006, Ch. 10). This is particularly the case for the simplest of VI, mean-field VI (MFVI), where the variational family assumes (blocks of) parameters to be mutually independent. This assumption significantly reduces the complexity of the approximation problem and often enables fast optimization steps resembling the structure of Gibbs updates in MCMC. However, the restrictive assumption of posterior independence is often at odds with the true posterior such that MFVI provides sensible point estimates but may severely underestimate parameter uncertainty.

As a consequence, various approaches beyond simple MFVI have been suggested (as reviewed, for example, in Zhang et al. 2018). One obvious remedy is to combine as many parameters as possible in one block such that one multivariate variational distribution is constructed, therefore mitigating the mean-field assumption (see for example Hui et al. 2019; Luts et al. 2014). However, this comes at the price of determining a fully unstructured covariance matrix for all parameters simultaneously, which requires handling of large matrices, especially for a large number of effects. An alternative is the semi-implicit VI (SIVI) approach recently developed by Yin and Zhou (2018). Compared to MFVI, it increases the complexity of the variational distribution allowing for some parameter dependencies. Firstly, SIVI uses a hierarchical construction of the variational parameters to bring back parameter dependencies based on hierarchical VI (Ranganath et al. 2016). Secondly, the mixing distribution on the higher level of the hierarchy does not need to be an analytic probability density function, meaning a highly flexible implicit distribution can be chosen, i.e. the distribution is not required to have an analytic probability density function but samples can be generated from it (Diggle and Gratton 1984). While this approach brings simulations back into the inferential procedure, the underlying reasoning relies on law of large numbers asymptotics which are much easier to control and monitor than the distributional convergence of a Markov chain towards its limiting stationary distribution.

In this paper, we are focusing on semiparametric additive models as a particularly important special case of statistical modelling where Bayesian inference has gained considerable interest and both Gibbs sampling (Lang and Brezger 2004) and simple MFVI (Luts and Wand 2015; Waldmann and Kneib 2015; Hui et al. 2019) have been developed. More precisely, we

review different forms of VI, including MFVI and SIVI, in their general form,
develop their specific forms in semiparametric additive models including an improved MFVI approach where all regression coefficients associated with the additive components are combined in one block following ideas developed in Luts and Wand (2015) and Hui et al. (2019) and a combination of SIVI and MFVI (SIMFVI) that leads to more robust results and speeds up the optimization compared to the SIVI approach,
investigate the performance of the different forms of VI with a specific focus on quantifying uncertainty in simulations to provide guidance on their reliability and applicability, and
apply the methods to a data set on tree height of Douglas fir in a large-scale forestry data set.

We find that SIVI and SIMFVI effectively restore parameter uncertainty such that local and simultaneous credible intervals are accurately represented. However, the improved version of MFVI shows comparable performance such that combining all regression parameters in one block and therefore incorporating across effect dependence seems to be the crucial aspect in constructing an appropriate approximating distribution.

The structure of this article is as follows: In Sect. 2, we briefly introduce the necessary background on Bayesian additive regression models. Section 3 describes the methodology of VI and derives the algorithms for the different forms of MFVI and SIVI both in general and in the context of additive models. In Sect. 4, we compare all introduced methods and the Gibbs sampler in a simulation study with a focus on uncertainty quantification. Section 5 describes an application of the presented methods to tree heights of Douglas fir. In the final section, we summarize our results and briefly discuss limitations and potential directions for future research.

2 Bayesian additive models

We consider Bayesian forms of semiparametric additive models for regression data $(y_i, \varvec{x}_i)$, $i=1,\ldots ,n$ where $y_i$ denotes the response variable and $\varvec{x}_{i}$ is a vector of explanatory variables of different type. More specifically, we assume the model structure

$$\begin{aligned} y_i = \sum _{j=1}^{p}f_j(\varvec{x}_{ij})+ \epsilon _i, \end{aligned}$$

where $\epsilon _i \sim \mathcal {N}(0, \sigma ^2)$ represents the independent Gaussian error term while the effect of the covariates is additively decomposed into p effects $f_j(\cdot )$ that may represent linear, nonlinear, clustered (random), or spatial effects (among others) in a generic form. Each of the effects is then expanded in $d_j$ basis functions as

$$\begin{aligned} f_j(\varvec{x}_{ij}) = \sum _{l=1}^{d_j}\gamma _{jl}B^j_l(\varvec{x}_{ij}) \end{aligned}$$

with effect-specific basis functions $ B^j_l(\varvec{x}_{ij})$ and corresponding basis coefficients $\gamma _{jl}$. In vector–matrix notation, this implies the model

$$\begin{aligned} \varvec{y}&= \varvec{Z}_1\varvec{\gamma }_1 + \ldots + \varvec{Z}_p\varvec{\gamma }_p + \varvec{\epsilon } = \varvec{Z}\varvec{\gamma } + \varvec{\epsilon } \end{aligned}$$

(1)

where $\varvec{y}$ and $\varvec{\epsilon }$ are vectors of responses and error terms, the design matrices of basis function evaluations are denoted as $\varvec{Z}_j$ and $\varvec{\gamma }_j$ are the corresponding vectors of basis coefficients. Stacking all design matrices and basis coefficients into the matrix $\varvec{Z}$ and $\varvec{\gamma }$ yields the final representation as a large linear model.

To regularize the estimation of the basis coefficients, we employ multivariate normal priors

$$\begin{aligned} p(\varvec{\gamma }_{j}|\tau _j^2)&\propto \frac{1}{\big (2\pi \tau _j^2\big )^{\frac{\text {rank}(\varvec{K}_j)}{2}}} \text {exp}\left( -\frac{\varvec{\gamma }_j'\varvec{K}_j\varvec{\gamma }_j}{2\tau _j^2}\right) , \end{aligned}$$

(2)

with zero mean and precision matrix $\varvec{K}_j / \tau _j^2$. The precision matrix is chosen to reflect desirable regularization properties such as smoothness or shrinkage and may contain a non-trivial null space rendering Equation (2) into a partially improper prior specification. The impact of the prior on the posterior is regulated by the prior variance parameter $\tau _j^2$. In the remainder of this paper, we will employ weakly informative inverse gamma priors $\tau _j^2 \sim \text {IG}(a_j, b_j) $, with default values of $ a_j = b_j = 0.1$, but other prior distributions are easily conceivable. Similarly, we assign weakly informative inverse Gamma priors to $\sigma ^2 $, $\sigma ^2 \sim \text {IG}(a_{\sigma ^2}, b_{\sigma ^2}) $, with the same default values. Analytic forms of the distribution of the likelihood and the priors are shown in Appendix Sect. 7.3.

Each effect type takes a specific form by choosing the basis functions $B^j_l(\varvec{x}_{ij})$ and the penalty matrix $ \varvec{K}_j$ (see Fahrmeir et al. 2021, for details):

For linear effects, the basis functions are the untransformed covariates, $ \varvec{Z}_j = \varvec{x}_{\cdot j} $, where $ \varvec{x}_{\cdot j} $ is a row vector representation of covariate j and a flat prior is obtained by setting $\varvec{K}_j = \varvec{0}$.
In the case of clustered “random” effects, the basis functions represent dummy coding for the grouping variables and the penalty matrix equals the identity matrix, i.e. $ \varvec{K}_j = \varvec{I}_j$.
For nonlinear effects of continuous covariates, we use Bayesian P-splines (Lang and Brezger 2004) that are based on B-Spline basis functions in combination with a penalty matrix based on the kth-order random walk prior, e.g. a second-order random walk defined as ${\gamma }_{jl} = 2{\gamma }_{j,l-1} - {\gamma }_{j,l-2} + u_j,$ with Gaussian errors $ u_j \sim \mathcal {N}(0,\tau _j^2)$ and flat priors for $\gamma _{j1}$ and $\gamma _{j2}$. In this way subsequent coefficients are penalized leading to a smoother functional form. The penalty matrix can then be constructed based on a difference matrix $ \varvec{D}_j $ such that $ \varvec{K}_j = \varvec{D}_j'\varvec{D}_j $.
The concept of Bayesian P-splines can be extended to bivariate tensor product P-splines for fitting spatial effects or interaction surfaces $f_j(\varvec{x}_{j1}, \varvec{x}_{j2}) $. This is achieved by combining the two univariate spline basis matrices $ \varvec{Z}_{j1} $ and $ \varvec{Z}_{j2} $ in terms of all $ d_{j1} \cdot d_{j2} $ pairwise interactions. The penalty matrix is constructed by combining the two univariate spline penalties, $ \varvec{K}_{j1} $ and $ \varvec{K}_{j2} $ to $\varvec{K}_{j} = \varvec{K}_{j1} \otimes \varvec{I}_{d_{j2}} + \varvec{I}_{d_{j1}}\otimes \varvec{K}_{j2}$ such that smoothness is enforced in both covariate directions, see also Appendix Sect. 7.1.

For univariate and bivariate effects as described here, the following two points should be considered. Firstly, the penalty matrix $ \varvec{K}_j $ is rank deficient and therefore the prior is improper. However, it can be shown that the resulting posterior is still proper (see Appendix Sect. 7.4). Secondly, further restrictions need to be imposed to ensure the identifiability of the model. We use the restriction of a centering constraint in the design matrix (see Appendix Sect. 7.2 for more details).

3 Variational inference in additive model

Variational inference (VI), as used in the Bayesian framework, casts the integration problem associated with obtaining the posterior distribution into an optimization problem. During the optimization, VI searches among a set of candidate distributions for the one approximating the posterior distribution best. If the set of candidate distributions approaches the complexity of the true distribution, VI promises to be computationally faster than MCMC while the quality of the results can be comparable. For instance, You et al. (2014) and Wang and Blei (2019) showed consistency for the VI approach in additive models. However, the procedure requires careful choices to be made which determine the quality of the approximation:

The variational family $\mathcal {Q}$, i.e. the set of candidate distributions since a misspecification will directly limit the quality of the estimated posterior.
The measure determining the quality of an element of the variational family relative to the exact posterior distribution. The classical divergence measure is the Kullback–Leibler-divergence (KL-divergence), but also other more general measures as described in’ Zhang et al. (2018) are possible.
The algorithm to searching for the best approximating variational distribution by finding the best combination of variational parameters $\varvec{\psi }$ by optimizing the divergence measure. Again, Zhang et al. (2018) discuss different aspects including algorithms and strategies for variance reduction in the context of stochastic VI.

General overviews of variational inference are given in Bisho (2006, Ch. 10), Ormerod and Wand (2010) and Blei et al. (2017). In this paper, we only address the first point and describe possible extensions to the variational family $\mathcal {Q}$.

In the following, we introduce four different variational families used in this article to approximate the posterior distribution arising in Bayesian additive models. Two of the approaches presented are based on mean field approximations (see Sects. 3.2.1 and 3.2.2) while the remaining families proposed are based on the idea of semi-implicit variational inference (SIVI, Yin and Zhou 2018, see Sects. 3.2.3 and 3.2.4).

We denote the vector of model parameters as $\varvec{\theta }$ and its posterior density with $p(\varvec{\theta }| \varvec{y})$. The elements of the variational family $\mathcal {Q}$, i.e. the variational distributions, are denoted as $q_{\varvec{\psi }}$ where $\varvec{\psi }$ is the vector of variational parameters. The density of the variational distribution is denoted as $q_{\varvec{\psi }}(\varvec{\theta })$.

To measure the deviation between the variational distribution $q_{\varvec{\psi }}$ and the posterior distribution the Kullback–Leibler (KL) divergence,

$$\begin{aligned} {\text {KL}}(q_{\varvec{\psi }}(\varvec{\theta })||p(\varvec{\theta }|{\textbf {y}})) = \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,q_{\varvec{\psi }}(\varvec{\theta })] - \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,p(\varvec{\theta }, {\textbf {y}})] + \log \,p({\textbf {y}}), \end{aligned}$$

is used (Jordan et al. 1999; Ormerod and Wand 2010). The KL divergence decreases with increasing similarity of the two distributions and is zero for two identical distributions. Hence, we want to find $\varvec{\psi }^\star $ minimizing the KL divergence. Instead of working directly with the KL divergence, the minimization problem is reformulated as an equivalent maximization problem. Precisely, $\varvec{\psi }^\star $ is determined by maximizing the evidence lower bound (ELBO),

$$\begin{aligned} \mathcal {L} (\varvec{\psi })&= \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,p(\varvec{\theta }, {\textbf {y}})] - \mathbb {E}_{\varvec{\theta }\sim q_{\varvec{\psi }}}[\log \,q_{\varvec{\psi }}(\varvec{\theta })], \end{aligned}$$

not containing the intractable marginal likelihood or model evidence $p({\textbf {y}})$ which does not depend on $\varvec{\psi }$. The ELBO serves as the lower bound to the model evidence.

3.1 Mean-field and semi-implicit VI

3.1.1 Mean-field VI

Mean-field variational inference (MFVI, Parisi 1988; Saul and Jordan 1998) is based on a strong simplification assuming the posterior distribution can be approximated using independent parameter blocks, thus allowing to express the variational density as a product of the independent densities of the parameter blocks. The advantage of this simplification lies in easing the computation (Wainwright and Jordan 2008, p. 127-147) and the resulting speed gains. An iterative optimization scheme can be constructed by iteratively updating the variational parameters associated with one sub-vector such that the update mechanism maximizes the ELBO in each step. For example, the coordinate ascend variational inference algorithm (CAVI, Bishop 2006, Ch. 10) works in this way.

Suppose, the vector of model parameters is divided into p sub-vectors such that $\varvec{\theta }= (\varvec{\theta }_1', \dots , \varvec{\theta }_p')'$. Using the MFVI approach, the variational density can be expressed as $ q_{\varvec{\psi }}(\varvec{\theta }) = \prod _{i=1}^p q_{\varvec{\psi }_i}(\varvec{\theta }_i), $ where $\varvec{\psi }_i$ are the variational parameters or a set of variational parameters associated with the variational distribution of the i-th subvector of $\varvec{\theta }$. Now the variational density of the i-th sub-vector maximizing the ELBO is

$$\begin{aligned} q^\star (\varvec{\theta }_i) \propto \text {exp}\left\{ \mathbb {E}_{\varvec{\theta }_{-i} \sim q_{\varvec{\psi }_{-i}}}\left[ \log (p(\varvec{y} | \varvec{\theta }) p(\varvec{\theta })) \right] \right\} , \end{aligned}$$

(3)

where $\varvec{\theta }_{-i}$ denotes the parameter vector $\varvec{\theta }$ without the i-th sub-vector and $q_{\varvec{\psi }_{-i}}$ the variational distribution with the associated variational parameters in the index of said vector (Bishop 2006, Ch. 10). When selecting $q_{\varvec{\psi }_{-i}}$ suitably and exploiting conditional conjugacy, a closed-form solution to $q^\star $ can be constructed by updating the variation parameter $\varvec{\psi }_i$, similar to the parameters describing the sampling distribution in a Gibbs update step. CAVI then repeatedly iterates over i to update $\varvec{\psi }_i$ until convergence of the ELBO.

3.1.2 Semi-implicit VI

SIVI (Yin and Zhou 2018) builds upon the idea of hierarchical variational inference (HVI) proposed by Ranganath et al. (2016) to reintroduce dependencies between the parameter blocks that are assumed independent in MFVI. To illustrate the concept of HVIt, suppose, we have three parameter blocks $\varvec{\theta }= (\varvec{\theta }_1, \varvec{\theta }_2, \varvec{\theta }_3)$ and for the variational parameters, namely, $\varvec{\psi }_2$ and $\varvec{\psi }_3$, a variational hyper-distribution $q_{\varvec{\phi }}$ is assumed. The variational density of $\varvec{\theta }$ with the variational parameters $\varvec{\psi }_1$ and $\varvec{\phi }$ can then be expressed as

$$\begin{aligned} q_{\varvec{\psi }_1, \varvec{\phi }}(\varvec{\theta })&= q_{\varvec{\psi }_1}(\varvec{\theta }_1) \int q_{\varvec{\psi }_2}(\varvec{\theta }_2) q_{\varvec{\psi }_3}(\varvec{\theta }_3) q_{\varvec{\phi }}(\varvec{\psi }_2, \varvec{\psi }_3) \text {d}\varvec{\psi }_2 \text {d}\varvec{\psi }_3. \end{aligned}$$

(4)

Thus, the dependency between $\varvec{\theta }_2$ and $\varvec{\theta }_3$ in the posterior can be restored in the variational distribution via dependency between $\varvec{\psi }_2$ and $\varvec{\psi }_3$ introduced via $q_{\varvec{\phi }}$. The expansion of the variational family comes at the expense of increasing the computational burden. Hence, there is a trade-off between choosing MFVI as the faster optimization method and HVI which gives better approximations to the posterior in more complex settings but slows down the computational speed.

SIVI takes the idea of HVI a step further by allowing $q_{\varvec{\phi }}$ to be an implicit distribution, meaning a distribution for which the density cannot be evaluated but for which we can sample from. This renders Equation (4) not analytical solvable and we cannot access the ELBO directly. Instead, the authors suggest constructing a lower bound to the ELBO. More precisely, the lower bound $\tilde{\mathcal {L}_0}$ is constructed as,

$$\begin{aligned} \mathcal {L} (\varvec{\psi }_1, \varvec{\phi })&= -\mathbb {E}_{(\varvec{\psi }_2, \varvec{\psi }_3) \sim q_{\varvec{\phi }}}{\text {KL}}(q_{\varvec{\psi }}(\varvec{\theta })||p(\varvec{\theta }|{\textbf {y}})) + \log \,p({\textbf {y}})\\&\ge -{\text {KL}}(\mathbb {E}_{(\varvec{\psi }_2, \varvec{\psi }_3) \sim q_{\varvec{\phi }}}q_{\varvec{\psi }}(\varvec{\theta })||p(\varvec{\theta }|{\textbf {y}})) + \log \,p({\textbf {y}}) = \widetilde{\mathcal {L}}_0 (\varvec{\psi }_1, \varvec{\phi }) \end{aligned}$$

using Jensen’s inequality with the observation that the KL-divergence can be viewed as a convex functional (a proof is provided in Yin and Zhou 2018, Appendix A).

$\tilde{\mathcal {L}_0}$ can be used to as a target optimize the variational parameters. In practice, the implicit distribution is implied by the transformation of some noise $\varvec{\varepsilon } \sim \mathcal {D}$ (e.g. $\mathcal {D}$ is the k-dimensional standard normal distribution) with a deep neural net such that $(\varvec{\psi }_2', \varvec{\psi }_3')' = (T_{\varvec{\phi }}(\varvec{\varepsilon })_1', T_{\varvec{\phi }}(\varvec{\varepsilon })_2')' = T_{\varvec{\phi }}(\varvec{\varepsilon })$. However, Yin and Zhou (2018) show in Proposition 1 that optimizing $\tilde{\mathcal {L}_0}$ without early stopping can lead to a degenerated distribution for $\psi _1, \psi _2$, i.e. a distribution with a single point-mass. To avoid this, the author suggest to add a regularizing term to $\tilde{\mathcal {L}_0}$ yielding

$$\begin{aligned}{} & {} \widetilde{\mathcal {L}}(\varvec{\psi }_1, \varvec{\phi }) = \mathbb {E}_{\varvec{\varepsilon } \sim \mathcal {D}} \mathbb {E}_{\varvec{\theta }\sim q_{(\varvec{\psi }_1, T_{\varvec{\phi }}(\varvec{\varepsilon }))}} \mathbb {E}_{\varvec{\varepsilon ^{(1)}, \dots , \varvec{\varepsilon ^{(K)}} \sim \mathcal {D}}} \Bigg [ \log p(\varvec{\theta }, {\textbf {y}}) - \log q_{\varvec{\psi }_1}(\varvec{\theta }_1) \nonumber \\{} & {} \quad -\log \prod _{i=2}^{3}\bigg ( \frac{1}{K+1} \Big ( q_{T_{\varvec{\phi }}(\varvec{\varepsilon })_i}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon })_i) + \sum _{k=1}^{K} q_{T_{\varvec{\phi }}(\varvec{\varepsilon }^{(k)})_i}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon }^{(k)})_i) \Big ) \bigg ) \Bigg ] \end{aligned}$$

(5)

as the target for optimization to which refer from here on as the lower bound ELBO (lbELBO).

Yin and Zhou (2018) show that with increasing K the lbELBO approaches the ELBO reaching equality for $K \rightarrow \infty $. The expectations in the lbELBO can be estimated via stochastic approximation. Note that the conditional densities $ q_{T_{\varvec{\phi }}(\varvec{\varepsilon })_i}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon })_i) $ can also include non-hierarchical variational parameters, e.g. $ q_{T_{\varvec{\phi }}(\varvec{\varepsilon })_i, \varvec{\psi }_{i,2}}(\varvec{\theta }_i|T_{\varvec{\phi }}(\varvec{\varepsilon })_i) $ with additional fixed parameters $ \varvec{\psi }_{i,2} $. Finally, updates to the variational parameters are based on the respective gradients. The gradients are available via reverse-mode automatic differentiation exploiting the reparametrization trick (Kingma and Welling 2014). In particular, the updates at iteration $\ell $ are given by

$$\begin{aligned} \varvec{\phi }^{(\ell )}&= \varvec{\phi }^{(\ell - 1)} + \rho _1^{(\ell )} \,\nabla _{\varvec{\phi }}\,\widetilde{\mathcal {L}}(\varvec{\psi }_1, \varvec{\phi }), \\ \varvec{\psi _1}^{(\ell )}&= \varvec{\psi _1}^{(\ell - 1)} + \rho _2^{(\ell )} \,\,\nabla _{\varvec{\psi _1}}\,\widetilde{\mathcal {L}}(\varvec{\psi }_1, \varvec{\phi }), \end{aligned}$$

with exponential decaying learning rates $\rho _1^{(l)}, \rho _2^{(l)}$. Adding decaying learning rates improved numerical stability and showed better results overall.

The flexibility of the variational family in SIVI is only limited in two ways: First, the implicit variational prior distribution must be reparameterizable. That is, a distribution that can be sampled from using an auxiliary variable $ \varvec{\epsilon } $ that is transformed through a differentiable transformation T(.) e.g. $ \varvec{\psi } = T_{\varvec{\phi }}( \varvec{\epsilon }) $ with $ \varvec{\epsilon } \sim \mathcal {N}(\varvec{0},\varvec{I})$. Second, the conditional variational distribution of the coefficients must be analytic and reparameterizable or, as demonstrated in Yin and Zhou (2018), the ELBO must be analytic.

3.2 Mean-field and semi-implicit VI for additive models

In this section, we discuss the concrete implementations of two MFVI approaches and two SIVI approaches for the additive model.

3.2.1 Mean-field VI with block-diagonal covariance matrix

In the additive model, the mean field factorization could be blocked as follows $\mathbf {\theta }= (\mathbf {\gamma }_1',\ldots ,\mathbf {\gamma }_p', \tau _1, \dots , \tau _p, \sigma ^2)'$ with the variational density given by the factors

$$\begin{aligned} q_{\varvec{\psi }}(\varvec{\theta }) =q_{\varvec{\psi }_{\sigma ^2}}(\sigma ^2) \prod _{j=1}^{p} q_{\varvec{\psi }_{\varvec{\gamma }_j}}(\varvec{\gamma }_{j}) \, q_{\varvec{\psi }_{\tau ^2_j}}(\tau ^2_j), \quad q_{\varvec{\psi }} \in \mathcal {Q}_{\text {MFb}}, \end{aligned}$$

(Waldmann and Kneib 2015). Using Equation (3) and exploiting conditional conjugacy results in each $q_{\varvec{\psi }_{\varvec{\gamma }_j}}$ to be multivariate Gaussian and the remaining distributions as inverse gamma parametrized with the parameter vector in the index.

This formulation allows the construction of iterative updates to the variational parameters as follows: For $\varvec{\psi }_{\varvec{\gamma }_j}$ as re-parametrization of the mean vector and covariance matrix, i.e. $\varvec{\psi }_{\varvec{\gamma }_j} = (\varvec{\mu }_j, \varvec{\Sigma }_j)$, in the variational distribution of $\varvec{\gamma }_j$, the updates are

The variational distribution $ q^*(\sigma ^2)$ of the error variance is and the updates for the variational parameters are

(6)

(7)

where $ \varvec{Z} = (\varvec{Z}_1, \ldots , \varvec{Z}_p) $ and $\varvec{\mu } =(\varvec{\mu }_1', \ldots , \varvec{\mu }_p')'$ are the stacked design matrices and mean vectors, respectively. For the covariance matrix $ \varvec{\Sigma } $, we can rewrite the component-wise covariance matrices into one block diagonal covariance matrix, i.e. $\varvec{\Sigma } = \text {blockdiag}(\varvec{\Sigma }_1, \dots , \varvec{\Sigma }_p) $. Therefore, we call this approach MFVI with block-diagonal covariance matrix MFVI (block).

The variational distributions $ q^*(\tau _j^2) $ of the smoothing parameters are $\text {IG}(\nu _{a_j}, \nu _{b_j}), \;\forall \; j=1, \dots , p$ and the updates for the variational parameters are

$$\begin{aligned} \nu _{a_j}&= a_j + \frac{\text {rank}(\varvec{K}_j)}{2} , \end{aligned}$$

(8)

$$\begin{aligned} \nu _{b_j}&= b_j + \frac{1}{2} \left( \text {tr}\left( \varvec{K}_j\varvec{\Sigma }_j\right) + \varvec{\mu }_j'\varvec{K}_j\varvec{\mu }_j\right) . \end{aligned}$$

(9)

The full derivations for the variational distribution of the coefficients, the error variance and the smoothing parameters are shown in Appendix Sect. 7.4.

3.2.2 Mean-field VI with full covariance matrix

In the special case of a multivariate Gaussian distribution for the coefficients, we can use a single multivariate Gaussian distribution for all coefficients such that the variational density factors to

$$\begin{aligned} q_{\varvec{\psi }}(\varvec{\theta }) =q_{\varvec{\psi }_{\sigma ^2}}(\sigma ^2)\, q_{\varvec{\psi }_{\varvec{\gamma }}}(\varvec{\gamma }) \prod _{j=1}^{p} q_{\varvec{\psi }_{\tau ^2_j}}(\tau ^2_j), \quad q_{\varvec{\psi }} \in \mathcal {Q}_{\text {MFf}}, \end{aligned}$$

with multivariate Gaussian distribution $q_{\varvec{\psi }_{\varvec{\gamma }}}$. The iterative updates to the variational parameters are,

As the covariance matrix $ \varvec{\Sigma } $ is a full and unstructured covariance matrix we call the approach MFVI with a full covariance matrix MFVI (full). The updates for the error variance and the smoothing parameters are the same as in MFVI (block).

In MFVI (full), the mean-field assumption plays a crucial part but is not very restrictive. First, the assumption of independence between the error variance and the coefficients is only a mild assumption. For instance, in the case of a diminishing penalty term that is close to zero, we can use the properties of ordinary least squares. That is, all columns of the design matrix are orthogonal to the residuals. For larger influences of the penalty term, however, this assumption is violated. Second, the assumption that smoothing parameters for each component are independent and that they are independent of the error variance and conditionally on the coefficients only lead to a mild restriction as this assumption is only imposed on the hyper-parameters. However, MFVI (full) is computationally very demanding for large variational covariance matrices $ \varvec{\Sigma }$ because inverting the $m \times m$ matrix $\varvec{\Sigma }$ involves $O(m^3)$ computations. Furthermore, MFVI (full) as discussed here in the case of conditional conjugate models is not very flexible, as it is limited to the case of a multivariate Gaussian variational distribution for the coefficients of all additive components. For smaller data sets, however, a multivariate Gaussian variational distribution may not capture heavier tails.

3.2.3 Semi-implicit VI

Using SIVI has the advantage of retaining a blocked covariance matrix, but at the same time increasing the flexibility of the variational distribution for the coefficients and thus restoring coefficient dependencies. Additionally, more complex posterior distributions can be captured. The variational density is

$$\begin{aligned} q_{\varvec{\phi }, \varvec{\nu }}(\varvec{\theta }) = \,&q_{\varvec{\psi }_{\sigma ^2}}(\sigma ^2) \prod _{j=1}^{p} q_{\varvec{\psi }_{\tau ^2_j}}(\tau ^2_j) \nonumber \\&\int \left( \prod _{j=1}^{p} q_{\varvec{\psi }_{\varvec{\gamma }_j}}(\varvec{\gamma }_{j}|\varvec{\psi }_{\varvec{\gamma }_j})\right) q_{\varvec{\phi }}(\varvec{\psi }_{\varvec{\gamma }}) \text {d}\varvec{\psi }_{\varvec{\gamma }}, \quad q_{\varvec{\phi }, \varvec{\nu }} \in \mathcal {Q}_{\text {HVM}}, \end{aligned}$$

(10)

with $ \varvec{\nu } = (\varvec{\psi }_{\sigma ^2}, \varvec{\psi }_{\tau ^2_1}, \dots ,\varvec{\psi }_{\tau ^2_p}) $. In this way the variational mean-field parameters for the coefficients, $\varvec{\psi }_{\varvec{\gamma }}$, are marginalized out. The p coefficient blocks are designed to be independent conditioned on $ \varvec{\psi }_{\varvec{\gamma }} $, marginally, however, dependencies between the blocks can be captured.

In line with the variational distributions of MFVI, we choose an inverse Gamma distribution for with and for each $ \tau ^2_j \sim \text { IG}(\nu _{a_j}, \nu _{b_j})$ with $ \varvec{\psi }_{\tau ^2_j} = (\nu _{a_j}, \nu _{b_j}) $. For the conditional variational distribution, we use a multivariate Gaussian distribution, i.e. $ \varvec{\gamma }_j|\varvec{\psi }_{\varvec{\gamma }_j} \sim \mathcal {N}(\varvec{\mu }_j, \varvec{\Sigma }_{\varvec{\xi }_j}) $ with $ \varvec{\psi }_{\varvec{\gamma }_j} = (\varvec{\mu }_j, \varvec{\Sigma }_{\varvec{\xi }_j}) $. As the optimization was numerically unstable for conditioning on both $ \varvec{\mu }_j $ and $ \varvec{\Sigma }_{\varvec{\xi }_j} $ (also conditioning on variational parameters of smoothing parameters and error variance lead to unreliable results), we apply the hierarchical expansion only on the variational parameters $ \varvec{\mu } = (\varvec{\mu }_1',\dots , \varvec{\mu }_p')' $, i.e. $ \varvec{\gamma }_j|\varvec{\mu }_j \sim \mathcal {N}(\varvec{\mu }_j, \varvec{\Sigma }_{\varvec{\xi }_j})$.

The variational parameter of the coefficients are generated as $\varvec{\mu } = T_{\varvec{\phi }}(\varvec{\epsilon }),$ with $ \varvec{\epsilon } \sim \mathcal {N}(\varvec{0}, \varvec{I}) $ and where $ T_{\varvec{\phi }}$ transforms variables $ \varvec{\epsilon } $ through a deep neural network with weight and bias parameters $ \varvec{\phi } $.

The covariance matrix $ \varvec{\Sigma }_{\varvec{\xi }_j} $ is parameterized with $ \varvec{\xi }_j $. It is the vectorized upper triangular matrix and part of the Cholesky decomposition to build the covariance matrix, i.e. $ \varvec{\Sigma }_{\varvec{\xi }_j} = g_U(\varvec{\xi }_j)'g_U(\varvec{\xi }_j) $, where $ g_U $ forms a Cholesky factor from its argument vector.

For more details about the structure of SIVI, we provide an illustration in Fig. 6 in Appendix Sect. 9.1 and a pseudo-algorithm in Algorithm 2 in Appendix Sect. 9.

The updates are determined by using a gradient-based approach. For variational hyper-parameters we use the Adam optimizer (Kingma and Ba 2015) and for the parameters $ \varvec{\xi } $ and $ \varvec{\nu } $ we use a stochastic gradient descend optimizer. In order to estimate the gradients, we rely on the lbELBO for additive models given by

(11)

with $ \varvec{\gamma }_{.,s} $ as the s-th sample of the stacked coefficient vector. All parts that include the expectation with respect to $ \sigma ^2 $ or each $ \tau ^2_j $ of Formula (11) are available in analytic form. Hence, we can use the same analytic derivations as for MFVI. The expectation with respect to parameters $ \varvec{\gamma }_1, \dots , \varvec{\gamma }_p $ is not tractable and needs to be approximated. We use stochastic approximation by taking S samples of each $ \varvec{\gamma }_{j,s}$ and average over them.

The choices of the variational distributions are in accordance with MFVI for reasons of comparison and the choice enables an analytic solution for MFVI. However, different MFVI algorithms have been developed to go beyond conditional conjugate models. In the case of SIVI, any arbitrary variational distribution for the error variance and smoothing parameter(s) can be used. Additionally, the conditional variational distribution is not restricted to be multivariate Gaussian, but other symmetric, as well as asymmetric distributions, can be considered.

3.2.4 Semi-implicit mean field VI

We propose an additional method which we call semi-implicit mean field variational inference (SIMFVI) that can be viewed as a hybrid between SIVI and MFVI. SIMFVI uses the same variational family as SIVI but the variational parameters are updated differently. Wherever possible, we use the analytical updates as in Equation (3) with noisy estimates of the expected values. For the additive model, this algorithm updates the variational hyperparameters with the gradient-based method as in SIVI. $ \varvec{\Sigma }_{j} $ and $\varvec{\nu } $ are updated similar to the analytic MFVI (block) updates. Indeed, we need to use a stochastic approximation of the expectation with respect to $\varvec{\gamma }_{j}$. Hence, the updates for the scale parameter of the error variance are,

$$\begin{aligned} \begin{aligned} \nu _{b_{\sigma ^2}}&= b_{\sigma ^2} + \frac{1}{2} \mathbb {E}_{\mathbf {\gamma } \sim q_{\mathbf {\psi }_{\mathbf {\gamma }}}}\bigg [ (\textbf{y} - \mathbf {Z\gamma })'(\textbf{y} - \mathbf {Z\gamma }) \bigg ] \\ {}&\approx b_{\sigma ^2} + \frac{1}{2} \bigg ( \textbf{y}'\textbf{y} - \frac{2}{S} \sum _{s=1}^{S} \mathbf {\gamma }_{.,s} \textbf{Z}'\textbf{y} + \frac{1}{S} \sum _{s=1}^{S} \mathbf {\gamma }_{.,s}'\textbf{Z}'\textbf{Z}\mathbf {\gamma }_{.,s} \bigg ), \end{aligned}\end{aligned}$$

The updates for the scale parameter of the smoothing parameter for component j are,

$$\begin{aligned} \begin{aligned} \nu _{b_j}&= b_j + \frac{1}{2} \mathbb {E}_{\mathbf {\gamma } \sim q_{\mathbf {\psi }_{\mathbf {\gamma }}}}\big [ \mathbf {\gamma }_j'\textbf{K}_j \mathbf {\gamma }_j\big ] \approx b_j + \frac{1}{2S} \sum _{s=1}^{S}{\mathbf {\gamma }_{j,s}'\textbf{K}_j \mathbf {\gamma }_{j,s}}. \end{aligned} \end{aligned}$$

We provide a full implementation of the described approaches using the Python programming language. The implementations incorporate the Python libraries numpy (Harris et al. 2020) and pytorch (Paszke et al. 2019) for generation of random numbers and automated differentiation.

4 Simulation

In the simulation study, we assess the uncertainty estimates across the introduced VI methods based on empirical coverage percentages over repeated simulations. The coverage percentage is the relative frequency of how many times the credible interval (CI) contains the true value. For a 95% CI, we, therefore, expect a coverage percentage that equals about the nominal level of 95%. In Bayesian inference the CI is either based on the highest density interval (Turkkan and Pham-Gia 1993) or on the quantiles of the distribution. In this paper the CI’s are based on the quantiles.

In the case of MFVI (block), we suppose the coverage percentage will be below the nominal level due to the strong assumption of independence between coefficient blocks. We also presume that MFVI (full) accurately captures parameter uncertainty, but it comes with the limitation of determining a fully unstructured covariance matrix for all parameters simultaneously which requires handling of large matrices. On the other hand, SIVI and SIMFVI combine the advantage of using a blocked structure with a hierarchical construction to restore parameter dependencies. Therefore, we hypothesize the coverage percentage of SIVI and SIMFVI is close to the nominal level as well. Additionally, we compare the aforementioned methods with the Gibbs sampler, which we use as a reference.

We assess both, the coverage of point-wise and simultaneous CI. For estimating simultaneous CI, we develop an efficient algorithm (see Algorithm 2 in Appendix 8) based on a fully Bayesian quantile-based approach (Krivobokova et al. 2010).

The data generating process (DGP) is based on two covariates that affect the response in a nonlinear way. Hence, for the model, we can use P-splines and can block the covariance matrix accordingly. The DGP has the form

(DGP)
$ y_i = f_1(x_{i1}) + f_2(x_{i2}) + \epsilon _i $.

The errors are independently generated from a Gaussian distribution with variance 0.5, i.e. $\epsilon _i\sim \mathcal {N}(0,0.5)$. The values for the covariates $x_{i1}$ and $x_{i2}$ are generated in two steps. In the first step, we generate values from a bivariate normal distribution, ${(z_{i1}, z_{i2})' = \varvec{z}_i \sim \mathcal {N}(\varvec{0}, \varvec{A} )}$, with $ \varvec{A} $ having ones on the diagonal and the value ${\rho \in (-1,1)}$ on the off-diagonals to control for correlations between the variables. We consider correlations of varied intensity, namely, no correlation ($\rho = 0$), medium correlation ($\rho = 0.45$), and strong correlation ($\rho = 0.9$). In the second step, we use a probability integral transform to the variables, such that $x_{i1} = 5\cdot F(z_{i1}) $ and $x_{i2} = 7\cdot F(z_{i2}) - 1$, with $ F(\cdot ) $ as univariate standard normal cumulative distribution function.

The two functional forms of the nonlinear effects are,

$$\begin{aligned} f_1(x_{i1})&= \sin \left( \frac{\pi }{4} \, x_{i1} - 1 \right) + 2 \exp \left( -(x_{i1} - 1)^2\right) , \end{aligned}$$

(12)

$$\begin{aligned} f_2(x_{i2})&= \sin \left( \frac{3\,\pi }{16} \, x_{i2} - \frac{1}{2} \right) + 2 \exp \left( -\frac{3}{2}(x_{i2} - \frac{1}{2})^2 \right) . \end{aligned}$$

(13)

The functional forms $ f_1 $ and $ f_2 $ have a similar shape, which additionally increases the difficulty to distinguish between the two effects. Both functions have one sharp peak around 1.2 and 0.6, respectively (see Fig. 7, in Appendix Sect. 9.2).

Moreover, we vary the sample sizes in the simulation study. For each of the three scenarios with varying correlations, we run the simulation study using 50, 250, or 500 observations. This results in total to nine different scenarios. For each scenario, we use 1000 replications.

We argue that the simulation results of using P-splines are transferable to other effect types involving clustered or spatial effects as the coefficient structure in the model remains the same. Only the basis functions and the number of coefficients change. Hence, it appears to be sufficient to limit the extent of the simulation study to two non-linear effects modeled with P-splines.

Simulation results, depicted in Fig. 1, show the coverage percentage for each spline and method for three scenarios with the most significant differences across the methods. These are, not surprisingly, the scenarios with high or medium correlation and a rather small sample size (the results of the other scenarios are shown in Fig. 8 in Appendix Sect. 9.2).

In the scenario with a high correlation between the covariates (middle and right plot), MFVI (block) has a very low coverage for both splines. The simultaneous CI of the estimated function for $f_2$ in the scenario with 50 observations has a coverage of below 70% that is significantly below the nominal level of 95%. SIVI, SIMFVI and also MFVI (full) show coverages of about the nominal level. For the scenario with medium correlation and 50 observations, however, the coverage percentage of simultaneous CI for $f_2$ for MFVI (block) as well as for MFVI (full) is well below the nominal level. This might be due to the fact that for small sample sizes the Gaussian distribution assumption on the coefficients might be too restrictive. The more flexible methods SIVI and SIMFVI show improvements in the coverage, but are also slightly below the nominal level.

For other criteria such as the mean squared error (MSE) for each spline and the overall MSE on the fitted values, no significant differences across the methods are visible. There is only a slight tendency that the different MSEs of SIVI and SIMFVI are larger on average. For instance, in the scenario with 50 observations and a strong correlation the smallest overall MSE value is the one of MFVI (block) with 0.116 and the largest is the one of SIVI with 0.123 (see Table 2 in Appendix Sect. 9.2 for further details about this scenario).

The simulations show very accurate results for the Gibbs sampler across all criteria and scenarios. However, the coverage percentage of the local and, in particular, simultaneous CIs, were above the nominal level by about 0.4 to 4.2 percentage points in all scenarios, indicating that the uncertainty is slightly overestimated.^{Footnote 1} In particular, the simultaneous CI bands of the estimated function for $f_1$ appear to be too wide. Nevertheless, we use the Gibbs sampler as the reference when comparing the methods in the application, as the MCMC approach is expected to give asymptotically exact results.

Assuming the samples of the MCMC approach to come from the desired posterior distribution, we also evaluate the KL divergence for each VI method based on these samples. This gives a more holistic evaluation over the complete distribution. We approximately evaluate,

$$\begin{aligned} -{\text {KL}}\left( p(\varvec{\theta |y}) || q(\varvec{\theta }) \right) \approx \frac{1}{S}\sum _{s=1}^S \log \, q(\varvec{\theta }_s) - \frac{1}{S}\sum _{s=1}^S \log \, p(\varvec{\theta }_s|\varvec{y}) \end{aligned}$$

with Gibbs samples $\varvec{\theta }_1,..., \varvec{\theta }_S \sim p(\varvec{\theta |y})$. Since we compare the different VI methods based on the same samples we only need to evaluate the term $\frac{1}{S}\sum _{s=1}^S \log \, q(\varvec{\theta }_s)$, that is the average logarithmized density given the Gibbs samples (ALDG). Higher values indicate better approximations to the true posterior. For SIVI and SIMFVI, we evaluate the density of the coefficients by averaging them out,

$$\begin{aligned} \log \, q(\varvec{\theta }_s) \approx \sum _j^p \left[ \log \frac{1}{\text {M}}\sum _{m=1}^\text {M} ( \, q_{\varvec{\mu }_{jm}, \varvec{\Sigma }_j}(\varvec{\gamma }_{js} | \varvec{\mu }_{jm})) + \log \, q_{a_j, b_j}(\tau ^2_{js})\right] + \log \, q_{a, b}(\sigma ^2_s), \end{aligned}$$

with M samples out of the neural network. For MFVI a full factorization applies.

The results confirm our previous findings. In Fig. 2 we show the ALDG of the coefficients for each VI method (we show figures of the total ALDG and for the different model parameters for all scenarios in Figs. 9, 10, 11, 12, 13, 14, 15, 16 and 17) in Appendix Sect. 9). MFVI (block) does not accurately capture the complete distribution, whereas SIVI and SIMFVI show significant improvements. However, evaluating the ALDG for high correlation between the covariates reveal that MFVI (full) performs slightly better. Hence, the hierarchical approach with flexible mixing distribution restores parameters dependencies to a large extend, but may fails to restore all dependencies.

The extend to how good SIVI and SIMFVI restores parameter dependencies is highly sensitive to the specifications of the neural network. Important considerations are the neural net structure and the activation function. We see a deterioration of the performance, if the neural net structure has more than 3 hidden layers and if the activation function is Sigmoid, instead of ReLU or Tanh. The specification for the input dimension, however, does not effect the results significantly (one example about the sensitivity is shown in Fig. 18 Appendix Sect. 9.2). Most importantly, however, is the choice of the learning rates for the parameter updates. In the simulation study it appears that in general higher learning rates (up to 0.1) improve the results, but the algorithm becomes less numerically stable. We show further details about the model set up and the choice of hyper-parameters in Appendix Sect. 7.7.

5 Application to tree height models of douglas fir

Douglas fir is a non-native conifer species to Germany. It is expected to be resilient to drought events and higher temperatures and, thus, with changing climatic conditions may serve as an important addition to the tree species portfolio of German climate-smart forestry. Modeling tree heights of Douglas fir is of high value for economic and climate consideration concerning, e.g. returns from investment and carbon storage potentials.

We use data from the national forestry inventory (NFI) of Germany and a climate data set provided by the Nordwestdeutsche Forstliche Versuchsanstalt.

We model heights of Douglas fir in Germany using two types of covariates, namely, tree- and climate-specific covariates. Some of those covariates have strong correlations. Accordingly, a method should be used that can reflect the increased uncertainty of the estimates. Therefore, we use the novel SIVI and SIMFVI methods and compare the results to the standard MFVI. Additionally, we use the Gibbs sampler as the benchmark method.

5.1 Statistical additive model

For modeling the mean tree height of Douglas fir, we employ an additive model (similar to Pya and Schmidt 2016). With the combined tree and climate data, we fit the following model for the observed tree height,

$$\begin{aligned} h_{i} \,=\;&\beta _0 + \beta _1 dbh_{i} + \beta _2 dbh^2_{i} + f_1(age_{i}) \nonumber \\&+ f_2(prec_{i})+ f_3(t_{i}) + f_4(alt_{i}) + f_{\text {geo}}(long_{i}, lat_{i}) + \epsilon _{i}, \end{aligned}$$

(14)

where we assume a Gaussian distributed error $\epsilon _i$.

The tree-specific data are the tree height in meters (h) that we use as response variable, the DBH in decimeters (dbh), and the age (age).

For all climate-specific variables, i.e. the accumulated precipitation per tree over its lifespan (prec) and the accumulated temperature per tree over its lifespan (t), and also for the adjusted altitude per location (alt) we use nonlinear effects as well.

Finally, to account for the spatial effect, we include a tensor product spline with the approximated coordinates for each tract, i.e. the longitude and latitude (long, lat). We provide more details about the model set up in Appendix Sect. 7.7.

We further split the data into 70% training and 30% test data to evaluate the predictive performance of each method. In total, we have 7,082 in the training and 3,035 observations in the test data set and overall 826 coefficients in the model.

5.2 Results

The results, as shown in Fig. 3, reveal the importance of considering climate variables to understand the tree heights of Douglas fir. Both increasing accumulated precipitation and temperature increase the expected tree height when looking at moderate values of the covariates. In case of more extreme values for accumulated temperature, the effect on tree height is less clear. For high values of accumulated precipitation, the estimated effect starts to decrease.

The estimated effect of age shows an expected functional form, basically following the typical height growth pattern over age, i.e. large height increment in younger ages and levelling off in older ages. The altitude of the tree location does not seem to play an important role.

When comparing the different proposed methods, both similarities and distinct differences are visible. The estimates for the mean effects are similar across all methods. Only SIVI deviates in some parameters. For higher values of the estimated mean effect of altitude, SIVI tends to be slightly below the other methods. Additionally, the mean of the error variance term for SIVI is with about 7.97 estimated higher compared to all other methods which estimates are in the range between 7.59 and 7.61 (see Appendix Sect. 9.3, Table 4 for more information).

The differences between the methods become more apparent when considering the CIs, in particular for the estimated effect of accumulated precipitation, accumulated temperature, and altitude. Here, we deal with the additional problem, that the covariates are correlated. The results show a similar pattern for correlated effects as discovered in the simulation study: the widths of the 95% simultaneous CI bands of MFVI (block) are narrower compared to the other methods. However, as opposed to the simulation, also SIVI is not able to match the CI widths of the Gibbs sampler. SIMFVI and MFVI (full) show results much closer to the one from Gibbs sampler. In regions with only a few observations, SIMFVI tends to have narrower CI bands compared to Gibbs and MFVI (full).

The biggest differences are in the CI bands of altitude. For MFVI (block) and SIVI, the CI bands are at one occasion above and at one below the zero line. Whereas for Gibbs, MFVI (full), and SIMFVI, the CI bands cover the zero line across all values of altitude.

Similarly, the spatial effects for the different methods show differences in uncertainty levels (see Fig. 4). To highlight the differences in CI width, we visualize the CI width of each VI method as share to the width of the Gibbs sampler. Darkblue areas mean the CI is on par with Gibbs and yellow means the CI width is about 30% of the one from Gibbs, which is the lowest measured share on a location.

The biggest differences are in south-west Germany. MFVI (block) and also SIVI show much narrower CI widths at these locations compared to the other methods. Similarly, in north Germany, the areas of MFVI (block) and SIVI are lighter shaded. In areas without observations (without grey dots), there appears to be no difference between the methods.

The accuracy of parameter uncertainty can still be improved in SIMFVI,^{Footnote 2} when altering some of the parameters for the algorithm. In particular, the number of samples K out of the neural net that is used to approximate the ELBO from below. Higher values further tighten the lower bound but to the expense of computational time. We opt to draw $ K=100 $ samples out of the neural net for SIVI and SIMFVI, but increasing the number to $ K=300 $, brings the CI width slightly closer to the one of Gibbs (see an example of the spatial effect with SIMFVI in Fig. 20 in Appendix Sect. 9.3). However, it significantly comes to the expense of computational time.

The improvement of an increase in K is only marginal as can be seen when evaluating the whole distribution based on the Gibbs samples (see Table 1). SIMFVI with $ K=300 $ is marginally closer to but still slightly worse than MFVI (full). In general, the ALDG confirms the findings of improved approximations in SIMFVI over MFVI (block). There is a substantial gap between the ALDG of those 2 methods, whereas SIVI shows only minor improvements.

Table 1 Estimated ALDG across VI methods

Full size table

Finally, we compare the predictive performance of each method. The predictive power is similar across all methods. For the MSE and the predictive coverage on the test data, we do not find significant distinctions between the methods. The predictive coverage is about 93 to 96% for all methods just as expected from the nominal level (see Table 5 in Appendix Sect. 9.3 for more details).

6 Conclusion

As there is growing access to ever more data resources and with it a growing interest in fast approximate methods, variational inference has gained considerably in popularity. In our analyses of additive models, variational inference performs well in terms of point estimates and parameter uncertainty, even despite making use of the strong mean-field assumption. However, the performance might degrade, and in particular, the parameter uncertainty is underestimated, if the mean-field assumption is placed on critical parameters, such as different coefficient blocks. This, might nevertheless be of interest, as treating all coefficients simultaneously, might need handling and estimation of large matrices, in particular when using combinations of spatial and cluster effects in a model, that requires numerous coefficients.

The SIVI and SIMFVI algorithms proposed are capable of using a blocked structure on the coefficients, but they still give accurate results on parameter uncertainty. In cases when a variational Gaussian distribution on the coefficients is too restrictive, SIVI and SIMFVI can even outperform MFVI with a full covariance structure, due to allowing more flexibility on the coefficients posterior distribution. Yet, the performance of SIVI seems to deteriorate, if dealing with large matrices from e.g. spatial effects or a large number of observations. In these cases, the gradient based approach to estimate parameters of the covariance matrices appears to be rather inefficient. The SIMFVI algorithm solves this issue and additionally, needs less computational time.

There is, however, still much room for improvement when considering the computational time of SIVI and SIMFVI. More efficient implementations could make SIVI and SIMFVI much faster.

We only investigate the use case of blocked versus fully unstructured covariance matrices for the coefficients. Future research can address more complex scenarios including complex hierarchical models and extensions to generalized additive models.

Data availability

The tree specific data is available online, https://bwi.info/Download/de/BWI-Basisdaten/. The climate specific data is not freely available.

Code availability

All code is available upon request from the authors. The code will be publicly available with the publication of the manuscript.

Notes

A tendency to overestimate the uncertainty is also found in other studies related to MCMC approximations (Fahrmeir et al. 2004)
For this application, the accuracy of parameter uncertainty is not improved with SIVI.
The tolerance level is the minimum difference between the ELBO’s of 2 consecutive iterations. If this minimum is reached the algorithm stops.
The diameter is always taken at a height of about 1.30 m.
Thuenen-Institut, Dritte Bundeswaldinventur—Basisdaten (Stand 20.03.2015)—https://bwi.info/Download/de/BWI-Basisdaten/.
The vegetation periods are determined with the R-package vegperiod—https://cran.r-project.org/web/packages/vegperiod/index.html.
Link to Copernicus Space Component Data Access Portal: https://spacedata.copernicus.eu/web/cscda/dataset-details?articleId=394198.

References

Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Article MathSciNet Google Scholar
Diggle, P.J., Gratton, R.J.: Monte Carlo methods of inference for implicit statistical models. J. R. Stat. Soc. B 46(2), 193–212 (1984)
Fahrmeir, L., Kneib, T., Lang, S.: Penalized structured additive regression for space-time data: a Bayesian perspective. Stat. Sin. 14, 731–761 (2004)
MathSciNet Google Scholar
Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression—Models, Methods and Applications. Springer, Berlin (2021)
Book Google Scholar
Gregoire, T.G., Valentine, H.T.: Sampling Strategies for Natural Resources and the Environment. Chapman and Hall/CRC, London (2007)
Book Google Scholar
Harris, C.R., Millman, K.J., Walt, S.J.V., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., Kerkwijk, M.H., Brett, M., Haldane, A., Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Hui, F.K.C., You, C., Shang, H.L., Müller, S.: Semiparametric regression using variational approximations. J. Am. Stat. Assoc. 114(528), 1765–1777 (2019)
Article MathSciNet Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR, arXiv:1412.6980 (2015)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. arXiv:1312.6114v10 (2014)
Krivobokova, T., Kneib, T., Claeskens, G.: Simultaneous confidence bands for penalized spline estimators. J. Am. Stat. Assoc. 105(490), 852–863 (2010)
Article MathSciNet Google Scholar
Lang, S., Brezger, A.: Bayesian P-Splines. J. Comput. Graph. Stat. 13(1), 183–212 (2004)
Article MathSciNet Google Scholar
Luts, J., Broderick, T., Wand, M.P.: Real-time semiparametric regression. J. Comput. Graph. Stat. 23(3), 589–615 (2014)
Article MathSciNet Google Scholar
Luts, J., Wand, M.P.: Variational inference for count response semiparametric regression. Bayesian Anal. 10(4), 991–1023 (2015)
Article MathSciNet Google Scholar
Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64(2), 140–153 (2010)
Article MathSciNet Google Scholar
Padgham, M., Lovelace, R., Salmon, M., Rudis, B.: Osmdata. J. Open Source Softw. (2017). https://doi.org/10.21105/joss.00305
Article Google Scholar
Parisi, G.: Statistical Field Theory. Addison-Wesley, Redwood City (1988)
Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc, pp. 8024–8035 (2019)
Pya, N., Schmidt, M.: Incorporating shape constraints in generalized additive modelling of the height-diameter relationship for Norway spruce. For Ecosyst 3, 1–14 (2016)
Article Google Scholar
Ranganath, R., Tran, D., Blei, D.M.: Hierarchical variational models. In: International Conference on Machine Learning, pp. 324–333 (2016)
Saul, L.K., Jordan, M.I.: Exploiting tractable substructures in intractable networks. In: Advances in Neural Information Processing Systems, Cambridge, Mass. MIT Press, pp. 486–492 (1998)
Turkkan, N., Pham-Gia, T.: Computation of the highest posterior density interval in Bayesian analysis. J. Stat. Comput. Simul. 44(3–4), 243–250 (1993). (Publisher: Taylor & Francis)
Article Google Scholar
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)
Google Scholar
Waldmann, E., Kneib, T.: Variational approximations in geoadditive latent Gaussian regression: mean and quantile regression. Stat. Comput. 25(6), 1247–1263 (2015)
Article MathSciNet Google Scholar
Wang, Y., Blei, D.M.: Frequentist consistency of variational Bayes. J. Am. Stat. Assoc. 114(527), 1147–1161 (2019)
Article MathSciNet Google Scholar
Wood, S.N.: Generalized additive models: an introduction with R, 2 edition Chapman and Hall/CRC, London (2017)
Book Google Scholar
Yin, M., Zhou, M.: Semi-implicit variational inference. In: Proceedings of the 35th International Conference on Machine Learning, pp 5646–5655. PMLR (2018)
You, C., Ormerod, J.T., Mueller, S.: On variational Bayes estimation and variational information criteria for linear regression models. Aust. New Zealand J. Stat. 56(1), 73–87 (2014)
Article MathSciNet Google Scholar
Zhang, C., Bütepage, J., Kjellström, H., Mandt, S.: Advances in variational inference. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 2008–2026 (2018)
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge funding via the Deutsche Forschungsgemeinschaft for the Research Training Group 2300 “Enrichment of European beech forests with conifers”. Furthermore, we thank Nordwestdeutsche Forstliche Versuchsanstalt and in particular, Jan Schick, who is also part of the Research Training Group 2300 (subproject 9), for the provision of the climate data set adjusted for each stand.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

University of Göttingen, Chair of Statistics, Göttingen, Germany
Jens Lichter, Paul F V Wiemann & Thomas Kneib
University of Wisconsin-Madison, Department of Statistics, Madison, Wisconsin, United States
Paul F V Wiemann

Authors

Jens Lichter
View author publications
You can also search for this author in PubMed Google Scholar
Paul F V Wiemann
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Kneib
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jens Lichter.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Design matrices, model and derivations

1.1 Bivariate tensor product P-spline

A two-dimensional surface is fitted by allowing the smooth of one continuous covariate i.e. $ \varvec{x}_{j1} $ with basis function $B(\varvec{x}_{j1}) = \varvec{Z}_{j1} $ to vary smoothly with another continuous covariate $ \varvec{x}_{j2} $ with basis function $B(\varvec{x}_{j2}) = \varvec{Z}_{j2} $. This is achieved by combining the two univariate smooth design matrices $ \varvec{Z}_{j1} $ and $ \varvec{Z}_{j2} $ with the Kronecker product $ (\otimes ) $ for each row. The ith row of the constructed bivariate design matrix $ \varvec{Z}_j $ becomes,

$$\begin{aligned} \varvec{z}_{ij} = \varvec{z}_{ij1} \otimes \varvec{z}_{ij2}. \end{aligned}$$

The penalty is constructed by combining the univariate smooth penalties. The resulting penalty matrix that accounts for row-wise and column-wise differences is given by:

$$\begin{aligned} \varvec{K}_j = \varvec{I}_{d_{j2}} \otimes \varvec{K}_{j1} + \varvec{K}_{j2} \otimes \varvec{I}_{d_{j1}} \end{aligned}$$

1.2 Centering constraint in design matrix

To ensure identifiability in a model with more then one nonlinear predictor component we set the constraint (Wood 2017, chapter 1.8.1 & 4.2):

$$\begin{aligned} {\textbf{1}}' \varvec{Z}_u \, \varvec{\gamma }_u = \varvec{0}, \end{aligned}$$

where $\varvec{Z}_u$ is the unconstraint $N\times p $ design matrix (so $ \varvec{\textbf{1}}' \varvec{Z}_u $ is a $ 1 \times p $ vector) and unconstraint coefficients $\varvec{\gamma }_u$. One way of imposing the constraint is by using $p-1$ unconstraint parameters with the QR decomposition. The column sums of the design matrix can be factored to:

$$\begin{aligned} \varvec{Z}_u'\varvec{\textbf{1}} = \varvec{U} \begin{pmatrix} \varvec{a} \\ \varvec{0} \end{pmatrix}, \end{aligned}$$

where $ \varvec{U} $ is a $p\times p$ orthogonal matrix and $ \varvec{a} $ is a scalar. As a next step, $ \varvec{U} $ can be partitioned to $ \varvec{U} = (\varvec{D}: \varvec{C}) $ where $ \varvec{C} $ is a $ p \times (p - 1) $ matrix. Then

$$\begin{aligned} \varvec{C} \, \varvec{\gamma }_u = \varvec{\gamma }, \end{aligned}$$

will meet the constraints for any value of the $ p-1 $ dimensional vector $ \varvec{\gamma }_u $.

The new design matrix is:

$$\begin{aligned} \varvec{Z} = \varvec{Z}_u \varvec{C}, \end{aligned}$$

such that,

$$\begin{aligned} \varvec{\textbf{1}}' \varvec{Z} \varvec{\gamma } = \varvec{0} \end{aligned}$$

1.3 Model likelihood and priors

1.
Likelihood for $\varvec{y}~\sim \mathcal {N}(\varvec{Z\gamma }, \sigma ^2\varvec{I}_n)$:
$$\begin{aligned} p(\varvec{y}|\varvec{\gamma }, \sigma ^2)&= \prod _{i=1}^{\text {n}}\mathcal {N}(y_i|\varvec{\gamma }, \sigma ^2) \\&= \frac{1}{(2\pi \sigma ^2)^{\frac{\text {n}}{2}}} \text {exp}\left( -\frac{(\varvec{y}-\varvec{Z\gamma })'(\varvec{y}-\varvec{Z\gamma })}{2\sigma ^2}\right) \end{aligned}$$
2.
Prior distribution for the error variance $\sigma ^2\sim \text {IG}(a_{\sigma ^2}, b_{\sigma ^2})$:
$$\begin{aligned} p(\sigma ^2)&= \frac{b_{\sigma ^2}^{a_{\sigma ^2}}}{\Gamma (a_{\sigma ^2})} \left( \frac{1}{\sigma ^2}\right) ^{a_{\sigma ^2}+1} \text {exp}\left( -\frac{b_{\sigma ^2}}{\sigma ^2}\right) \end{aligned}$$
3.
Prior distribution for coefficients $ \varvec{\gamma }_j $:
$$\begin{aligned} p(\varvec{\gamma }_{j}|\tau _j^2)&\propto \frac{1}{\big (2\pi \tau _j^2\big )^{\frac{\text {rank}(\varvec{K}_j)}{2}}} \text {exp}\left( -\frac{\varvec{\gamma }_j'\varvec{K}_j\varvec{\gamma }_j}{2\tau _j^2}\right) \end{aligned}$$
4.
Prior distribution for smoothing parameter of the $ \tau _j^2\sim \text {IG}(a_j, b_j) $:
$$\begin{aligned} p(\tau _j^2)&= \frac{b_j^{a_j}}{\Gamma (a_j)} \left( \frac{1}{\tau _j^2}\right) ^{a_j+1} \text {exp}\left( -\frac{b_j}{\tau _j^2}\right) \end{aligned}$$

1.4 Derivation of variational densities for MFVI

For notational convenience, we further do not add the distribution to the model parameter, when using the expectation.. Instead of e.g. $ \mathbb {E}_{\varvec{\gamma } \sim q_{\varvec{\psi }_{\varvec{\gamma }}}} $, we write $ \mathbb {E}_{\varvec{\gamma }}$.

1.
Variational density for $\varvec{\gamma }$ coefficients with $\varvec{Z\gamma }_{-j}=\sum _{r \not = j}^{} \varvec{Z}_r\varvec{\gamma }_r$. Therefore, we have $\mathbb {E}_{-\varvec{\gamma }_j}\big [\varvec{\varvec{Z\gamma }}_{-j}\big ] = \sum _{r \not = j}^{} \varvec{Z}_r\mathbb {E}_{\varvec{\gamma }_r}\big [\varvec{\gamma }_r\big ]$:
$$\begin{aligned} q(\varvec{\gamma _j})&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\varvec{\gamma }_j}\Bigg [ \text {ln } p(\varvec{y}|\varvec{\gamma }, \sigma ^2)\, p(\varvec{\gamma }_j|\tau _j^2) \underbrace{p(\varvec{\gamma }_{-j}|\tau _{-j}^2) \,p(\tau _j^2) \,p(\sigma ^2)}_{\text {const}} \Bigg ] \Bigg \} \\&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\varvec{\gamma }_j}\Bigg [ \text { ln}\Bigg ( \frac{1}{(2\pi \sigma ^2)^\frac{n}{2}} \\&\quad \text {exp}\left( -\frac{(\varvec{y} - \varvec{Z\gamma }_{-j} - \varvec{Z}_j\varvec{\gamma }_{j})'(\varvec{y} - \varvec{Z\gamma }_{-j} - \varvec{Z}_j\varvec{\gamma }_{j})}{2\sigma ^2} \right) \\&\quad \frac{1}{\big (2\pi \tau ^2\big )^{\frac{d_j-1}{2}}} \text {exp}\left( -\frac{1}{2\tau _j^2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j\right) \Bigg ) \Bigg ] \Bigg \} \\&= \text {exp}\Bigg \{ \mathbb {E}_{-\varvec{\gamma }_j}\Bigg [ -\frac{1}{2\sigma ^2} (\varvec{y} - \varvec{Z\gamma }_{-j} - \varvec{Z}_j\varvec{\gamma }_{j})'(\varvec{y} - \varvec{Z\gamma }_{-j} - \varvec{Z}_j\varvec{\gamma }_{j}) \\&\quad -\frac{1}{2\tau _j^2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j + \text {const} \Bigg ] \Bigg \} \\&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\varvec{\gamma }_j}\Bigg [ - \frac{1}{2\sigma ^2} \Big ( \underbrace{\varvec{y}'\varvec{y} - \varvec{Z\gamma }_{-j}'\varvec{y} - \varvec{y}'\varvec{Z\gamma }_{-j} + \varvec{\gamma }_{-j}'\varvec{Z}'\varvec{Z\gamma }_{-j}}_{\text {does not contain } \varvec{\gamma }_j} \\&\quad - 2\varvec{\gamma }_j'\varvec{Z}_j'\varvec{y} + 2\varvec{\gamma }_j'\varvec{Z}_j'\varvec{Z\gamma }_{-j} + \varvec{\gamma }_j'\varvec{Z}_j'\varvec{Z}_j\varvec{\gamma }_j \Big ) \\&\quad -\frac{1}{2\tau _j^2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \Bigg ] \Bigg \} \\&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\varvec{\gamma }_j}\Bigg [ \frac{1}{\sigma ^2} \varvec{\gamma }_j'\varvec{Z}_j'\big ( \varvec{y} - \varvec{Z\gamma }_{-j} \big ) - \frac{1}{2\sigma ^2} \varvec{\gamma }_j'\varvec{Z}_j'\varvec{Z}_j\varvec{\gamma }_j -\frac{1}{2\tau _j^2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \Bigg ] \Bigg \} \\&= \text {exp}\Bigg \{ \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{\gamma }_j'\varvec{Z}_j' \bigg ( \varvec{y} - \mathbb {E}_{\varvec{\gamma }_{-j}}\bigg [\varvec{Z\gamma }_{-j} \bigg ] \bigg ) - \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \frac{1}{2} \varvec{\gamma }_j'\varvec{Z}_j'\varvec{Z}_j\varvec{\gamma }_j \\&\quad -\mathbb {E}_{\tau _j^2}\bigg [ \frac{1}{\tau _j^2} \bigg ] \frac{1}{2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \Bigg \} \\&= \text {exp}\Bigg \{ \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{\gamma }_j'\varvec{Z}_j' \bigg ( \varvec{y} - \mathbb {E}_{\varvec{\gamma }_{-j}}\bigg [\varvec{Z\gamma }_{-j} \bigg ] \bigg ) \\&\quad -\frac{1}{2} \varvec{\gamma }_j' \bigg ( \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{Z}_j'\varvec{Z}_j + \mathbb {E}_{\tau _j^2}\bigg [ \frac{1}{\tau _j^2} \bigg ] \varvec{K}_j \bigg ) \varvec{\gamma }_j \Bigg \} \\&= \text {exp}\Bigg \{ -\frac{1}{2} \varvec{\gamma }_j' \underbrace{\bigg ( \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{Z}_j'\varvec{Z}_j + \mathbb {E}_{\tau _j^2}\bigg [ \frac{1}{\tau _j^2} \bigg ] \varvec{K}_j \bigg )}_{\varvec{\Sigma }_j^{-1}} \, \varvec{\gamma }_j \\&\quad + \varvec{\gamma }_j'\varvec{\Sigma }_j^{-1} \underbrace{\varvec{\Sigma }_j\mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{Z}_j' \bigg ( \varvec{y} - \mathbb {E}_{\varvec{\gamma }_{-j}}\bigg [\varvec{Z\gamma }_{-j} \bigg ] \bigg )}_{\varvec{\mu }_j} \Bigg \} \end{aligned}$$
$$\begin{aligned} \mathbb {E}(\varvec{\gamma }_j)&= \varvec{\mu }_j = \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{\Sigma }_j \varvec{Z}_j' \bigg ( \varvec{y} - \mathbb {E}_{\varvec{\gamma }_{-j}}\bigg [\varvec{Z\gamma }_{-j} \bigg ] \bigg ) \\ \text {Var}(\varvec{\gamma }_j)&= \varvec{\Sigma }_j = \left( \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{Z}_j'\varvec{Z}_j + \mathbb {E}_{\tau _j^2}\bigg [ \frac{1}{\tau _j^2} \bigg ] \varvec{K}_j \right) ^{-1} \end{aligned}$$
For $ \varvec{\gamma }_j = \big (\gamma _{j1}, \gamma _{j2},..., \gamma _{jd_j} \big )' \text { and } \varvec{\mu }_j = \big (\mu _{j1}, \mu _{j2},..., \mu _{jd_j} \big )' $

Full Covariance update:
$$\begin{aligned} \mathbb {E}(\varvec{\gamma })&= \varvec{\mu } = \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{\Sigma } \varvec{Z}' \varvec{y} \bigg ) \\ \text {Var}(\varvec{\gamma })&= \varvec{\Sigma } = \left( \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \varvec{Z}'\varvec{Z} + \mathbb {E}_{\varvec{\tau }^2}\bigg [ \varvec{K} \bigg ] \right) ^{-1} \end{aligned}$$
With $\varvec{\mu } =(\varvec{\mu }_1',..., \varvec{\mu }_p')'$, $ \varvec{Z} = (\varvec{Z}_1,..., \varvec{Z}_p) $ and

$ \mathbb {E}_{\varvec{\tau }^2}\bigg [ \varvec{K} \bigg ] = \text {diag} \big ( \mathbb {E}_{\tau _1^2}\bigg [ \frac{1}{\tau _1^2} \bigg ] \varvec{K}_1,..., \mathbb {E}_{\tau _p^2}\bigg [ \frac{1}{\tau _p^2} \bigg ] \varvec{K}_p \big )$
2.
Variational density for $\varvec{\gamma }$ coefficients variance $ q(\tau _j^2) $:
$$\begin{aligned} q(\tau _j^2)&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\tau ^2}\Bigg [ \text {ln}\bigg ( p(\tau _j^2)\, p(\varvec{\gamma }_j|\tau _j^2) \bigg ) \Bigg ] \Bigg \} \\&= \text {exp}\Bigg \{ \mathbb {E}_{-\tau _j^2}\Bigg [ \text {ln}\bigg ( \frac{b_j^{a_j}}{\Gamma (a_j)}\left( \frac{1}{\tau ^2} \right) ^{a_j+1} \text {exp}\left( -\frac{b_j}{\tau _j^2} \right) \\&\quad \frac{1}{(2\pi \tau _j^2)^\frac{\text {rank}(\varvec{K}_j)}{2}} \text {exp}\left( -\frac{1}{2\tau _j^2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \right) \bigg ) \Bigg ] \Bigg \} \\&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\tau _j^2}\Bigg [(a_j + 1) \text { ln} \left( \frac{1}{\tau _j^2}\right) - \frac{b_j}{\tau _j^2} \\&\quad + \frac{\text {rank}(\varvec{K}_j)}{2}\text { ln} \left( \frac{1}{2\pi \tau _j^2}\right) - \frac{1}{2\tau _j^2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \Bigg ] \Bigg \} \\&\propto \text {exp}\Bigg \{ (a_j + 1) \text { ln} \left( \frac{1}{\tau _j^2}\right) + \frac{\text {rank}(\varvec{K}_j)}{2}\text { ln} \left( \frac{1}{\tau _j^2}\right) \\&\quad -\frac{b_j}{\tau _j^2} - \frac{1}{2\tau _j^2} \mathbb {E}_{\varvec{\gamma }_j}\big [ \varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \big ] \Bigg \} \\&= \text {exp}\Bigg \{ \text { ln} \left( \frac{1}{\tau _j^2}\right) \left( a_j + 1 + \frac{\text {rank}(\varvec{K}_j)}{2}\right) -\frac{1}{\tau _j^2} \left( b_j + \frac{1}{2} \mathbb {E}_{\varvec{\gamma }_j}\big [ \varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \big ] \right) \Bigg \} \\&= \left( \frac{1}{\tau _j^2} \right) ^{\underbrace{{\scriptstyle {a_j + \frac{\text {rank}(\varvec{K}_j)}{2}}}}_{\nu _{a_j}} + 1} \text {exp}\Bigg \{ -\frac{1}{\tau _j^2} \underbrace{\left( b_j + \frac{1}{2} \mathbb {E}_{\varvec{\gamma }_j}\big [ \varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \big ] \right) }_{\nu _{b_j}} \Bigg \} \\ \nu _{a_j}&= a_j + \frac{\text {rank}(\varvec{K}_j)}{2} \\ \nu _{b_j}&= b_j + \frac{1}{2} \mathbb {E}_{\varvec{\gamma }}\big [\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j\big ] \\&= b_j + \frac{1}{2} \left( \text {tr}\left( \varvec{K}_j\varvec{\Sigma }_j\right) + \varvec{\mu }_j'\varvec{K}_j\varvec{\mu }_j\right) \\ \mathbb {E}(\tau _j^2)&= \frac{\nu _{b_j}}{\nu _{a_j} - 1} \\ \mathbb {E}\left( \frac{1}{\tau _j^2}\right)&= \frac{\nu _{a_j}}{\nu _{b_j}} \end{aligned}$$
3.
Variational density for the variance of the error term:
$$\begin{aligned} q(\sigma ^2)&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\sigma ^2}\Bigg [ \text {ln}\bigg ( p(\varvec{y}|\varvec{\gamma }, \sigma ) \,p(\sigma ^2) \bigg ) \Bigg ] \Bigg \} \\&= \text {exp}\Bigg \{ \mathbb {E}_{-\sigma ^2}\Bigg [ \text {ln}\bigg ( \frac{1}{(2\pi \sigma ^2)^{\frac{n}{2}}} \text {exp}\left( -\frac{(\varvec{y}-\varvec{Z\gamma })'(\varvec{y}-\varvec{Z\gamma })}{2\sigma ^2} \right) \\&\quad \frac{b_{\sigma ^2}^{a_{\sigma ^2}}}{\Gamma (a)} \left( \frac{1}{\sigma ^2} \right) ^{a_{\sigma ^2}+1} \text {exp}\left( -\frac{b_{\sigma ^2}}{\sigma ^2} \right) \bigg ) \Bigg ] \Bigg \} \\&\propto \text {exp}\Bigg \{ \mathbb {E}_{-\sigma ^2}\Bigg [ \text {ln}\bigg (\frac{1}{\sigma ^2}\bigg ) \frac{n}{2} + \text { ln} \left( \frac{1}{\sigma ^2}\right) (a_{\sigma ^2}+1) \\&\quad -\frac{(\varvec{y}-\varvec{Z\gamma })'(\varvec{y}-\varvec{Z\gamma })}{2\sigma ^2} -\frac{b_{\sigma ^2}}{\sigma ^2} \Bigg ] \Bigg \} \\&= \text {exp}\Bigg \{ \text {ln}\bigg (\frac{1}{\sigma ^2}\bigg ) \left( \frac{n}{2} + a_{\sigma ^2} + 1\right) \\&\quad -\frac{1}{\sigma ^2}\left( b_{\sigma ^2} + \frac{1}{2}\mathbb {E}_{\varvec{\gamma }}\Bigg [ (\varvec{y}-\varvec{Z\gamma })'(\varvec{y}-\varvec{Z\gamma }) \Bigg ] \right) \Bigg \} \\&= \bigg ( \frac{1}{\sigma ^2} \bigg )^{\underbrace{{\scriptstyle a_{\sigma ^2}+ \frac{n}{2}}}_{\nu _{a_{\sigma ^2}}} + 1} \text {exp}\Bigg \{ -\frac{1}{\sigma ^2} \underbrace{\left( b_{\sigma ^2} + \frac{1}{2}\mathbb {E}_{\varvec{\gamma }}\Bigg [ (\varvec{y}-\varvec{Z\gamma })'(\varvec{y}-\varvec{Z\gamma }) \Bigg ] \right) }_{\nu _{b_{\sigma ^2}}} \Bigg \} \end{aligned}$$
$$\begin{aligned} \nu _{a_{\sigma ^2}}&= a_{\sigma ^2} + \frac{n}{2} \\ \nu _{b_{\sigma ^2}}&= b_{\sigma ^2} + \frac{1}{2} \mathbb {E}_{\varvec{\gamma }}\bigg [ (\varvec{y}-\varvec{Z\gamma })'(\varvec{y}-\varvec{Z\gamma }) \bigg ] \\&= b_{\sigma ^2} + \frac{1}{2} \bigg ( \varvec{y}'\varvec{y} - 2 \mathbb {E}_{\varvec{\gamma }}\bigg [\varvec{\gamma }'\bigg ]\varvec{Z}'\varvec{y} + \mathbb {E}_{\varvec{\gamma }}\bigg [\varvec{\gamma }'\varvec{Z}'\varvec{Z}\varvec{\gamma }\bigg ] \bigg )\\&= b_{\sigma ^2} + \frac{1}{2} \bigg ( \varvec{y}'\varvec{y} - 2 \varvec{\mu }'\varvec{Z}'\varvec{y} + \text {tr}\left( \varvec{Z}'\varvec{Z\Sigma }\right) + \varvec{\mu }'\varvec{Z}'\varvec{Z}\varvec{\mu } \bigg ) \\&= b_{\sigma ^2} + \frac{1}{2} \Big ( (\varvec{y}-\varvec{Z}\varvec{\mu })'(\varvec{y}-\varvec{Z}\varvec{\mu }) + \text {tr}\left( \varvec{Z}'\varvec{Z\Sigma }\right) \Big )\\ \mathbb {E}(\sigma ^2)&= \frac{\nu _{b_{\sigma ^2}}}{\nu _{a_{\sigma ^2}}- 1}\\ \mathbb {E}\left( \frac{1}{\sigma ^2}\right)&= \frac{\nu _{a_{\sigma ^2}}}{\nu _{b_{\sigma ^2}}} \end{aligned}$$
Note that $ \varvec{\Sigma } $ is either a blockdiagonal matrix with $ \varvec{\Sigma } = {{\,\textrm{diag}\,}}(\varvec{\Sigma }_1, \ldots , \varvec{\Sigma }_{p})$ or fully unstructured.

1.5 Derivation of ELBO for MFVI

For notational convenience, we further do not add the distribution to the model parameter, when using the expectation. Instead of e.g. $ \mathbb {E}_{\varvec{\gamma } \sim q_{\varvec{\psi }_{\varvec{\gamma }}}} $, we write $ \mathbb {E}_{\varvec{\gamma }}$.

$ \mathbb {E}_{\varvec{\gamma }, \sigma ^2}\bigg [\text {ln } p(\varvec{y}|\varvec{\gamma }, \sigma ^2) \bigg ] $
$$\begin{aligned} =&\mathbb {E}_{\varvec{\gamma }, \sigma ^2}\bigg [ \text {ln }\bigg ( \left( \frac{1}{2\pi \sigma ^2} \right) ^{\frac{n}{2}} \text {exp}\left( -\frac{1}{2\sigma ^2}(\varvec{y}-\varvec{Z\gamma })'(\varvec{y}-\varvec{Z\gamma }) \right) \bigg ) \bigg ] \\ \propto&\mathbb {E}_{\varvec{\gamma }, \sigma ^2}\bigg [ \frac{n}{2} \text { ln}\left( \frac{1}{\sigma ^2} \right) -\frac{1}{2\sigma ^2} \left( (\varvec{y} - \varvec{Z\gamma })' (\varvec{y} - \varvec{Z\gamma }) \right) \bigg ] \\ =&\frac{n}{2} \mathbb {E}_{\sigma ^2}\bigg [ \text {ln}\left( \frac{1}{\sigma ^2}\right) \bigg ] -\frac{1}{2}\mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \bigg ( \varvec{y}'\varvec{y} - 2 \mathbb {E}_{\varvec{\gamma }}\bigg [\varvec{\gamma }'\bigg ]\varvec{Z}'\varvec{y} + \mathbb {E}_{\varvec{\gamma }}\bigg [ \varvec{\gamma }'\varvec{Z}'\varvec{Z}\varvec{\gamma } \bigg ] \bigg ) \\ =&\frac{n}{2} \mathbb {E}_{\sigma ^2}\bigg [ \text {ln}\left( \frac{1}{\sigma ^2}\right) \bigg ] -\frac{1}{2}\frac{a^*}{b^*} \Big ( (\varvec{y}-\varvec{Z}\varvec{\mu }_{\varvec{\gamma }})'(\varvec{y}-\varvec{Z}\varvec{\mu }_{\varvec{\gamma }}) + \text {tr}\left( \varvec{Z}'\varvec{Z\Sigma }_{\varvec{\gamma }}\right) \Big ) \\ =&\frac{n}{2} \left( \digamma (a^*) - \text {ln} (b^*) \right) -\frac{1}{2}\frac{a^*}{b^*} \Big ( (\varvec{y}-\varvec{Z}\varvec{\mu }_{\varvec{\gamma }})'(\varvec{y}-\varvec{Z}\varvec{\mu }_{\varvec{\gamma }}) + \text {tr}\left( \varvec{Z}'\varvec{Z\Sigma }_{\varvec{\gamma }}\right) \Big ) \end{aligned}$$
Note if $ x \sim \text {InvGa}(a,b) => \frac{1}{x} \sim \text {Ga}(a,b)$ then

$ \mathbb {E}(\text {ln}(\frac{1}{x})) = \digamma (a) - \text {ln} (b)$, where $ \digamma (x) $ is the digamma function.
$ \mathbb {E}_{\varvec{\gamma }_j,\tau ^2}\bigg [\text {ln } p(\varvec{\gamma }_j|\tau _j^2) \bigg ] $
$$\begin{aligned} =&\mathbb {E}_{\varvec{\gamma }_j,\tau _j^2}\bigg [ \text {ln } \bigg ( \frac{1}{(2\pi \tau ^2)^{\frac{\text {rank}(\varvec{K}_j)}{2}}} \text {exp}\left( -\frac{\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j}{2\tau _j^2} \right) \bigg ) \bigg ] \\ \propto&\mathbb {E}_{\varvec{\gamma }_j,\tau _j^2}\bigg [ \frac{\text {rank}(\varvec{K}_j)}{2} \text { ln} \left( \frac{1}{\tau _j^2} \right) -\frac{1}{2\tau _j^2}\varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \bigg ] \\ =&\frac{\text {rank}(\varvec{K}_j)}{2} \mathbb {E}_{\tau _j^2}\bigg [ \text {ln} \left( \frac{1}{\tau _j^2} \right) \bigg ] -\frac{1}{2}\mathbb {E}_{\tau _j^2}\bigg [ \frac{1}{\tau _j^2} \bigg ] \mathbb {E}_{\varvec{\gamma }_j}\bigg [ \varvec{\gamma }_j'\varvec{K}_j \varvec{\gamma }_j \bigg ]\\ =&\frac{\text {rank}(\varvec{K}_j)}{2} \left( \digamma (\nu _{a_j}) - \text {ln} (\nu _{b_j}) \right) -\frac{1}{2}\frac{\nu _{a_j}}{\nu _{b_j}} \left( \text {tr}\left( \varvec{K}_j\varvec{\Sigma }_j \right) + \varvec{\mu }_j'\varvec{K}_j\varvec{\mu }_j \right) \end{aligned}$$
$ \mathbb {E}_{\tau _j^2}\bigg [\text {ln } p(\tau _j^2) \bigg ] $
$$\begin{aligned} =&\mathbb {E}_{\tau _j^2}\bigg [ \text {ln } \bigg ( \frac{b_j^{a_j}}{\Gamma (a_j)}\left( \frac{1}{\tau _j^2} \right) ^{a_j+1} \text {exp}\left( -\frac{b_j}{\tau _j^2} \right) \bigg ) \bigg ] \\ \propto&\mathbb {E}_{\tau _j^2}\bigg [ (a_j + 1) \text { ln} \left( \frac{1}{\tau _j^2}\right) - \frac{b_j}{\tau _j^2} \bigg ] \\ =&(a_j + 1) \mathbb {E}_{\tau _j^2}\bigg [\text {ln} \left( \frac{1}{\tau _j^2}\right) \bigg ] - b_j \mathbb {E}_{\tau _j^2}\bigg [\frac{1}{\tau _j^2} \bigg ] \\ =&(a_j + 1) \left( \digamma (\nu _{a_j}) - \text {ln} (\nu _{b_j}) \right) - b_j \frac{\nu _{a_j}}{\nu _{b_j}} \end{aligned}$$
$\mathbb {E}_{\sigma ^2}\bigg [\text {ln } p(\sigma ^2) \bigg ] $
$$\begin{aligned} =&\mathbb {E}_{\sigma ^2}\bigg [ \text {ln } \bigg ( \frac{b_{\sigma ^2}^{a_{\sigma ^2}}}{\Gamma (a_{\sigma ^2})}\left( \frac{1}{\sigma ^2} \right) ^{a_{\sigma ^2}+1} \text {exp}\left( -\frac{b_{\sigma ^2}}{\sigma ^2} \right) \bigg ) \bigg ] \\ \propto&\mathbb {E}_{\sigma ^2}\bigg [ (a_{\sigma ^2} + 1) \text { ln} \left( \frac{1}{\sigma ^2} \right) - \frac{b_{\sigma ^2}}{\sigma ^2} \bigg ] \\ =&(a_{\sigma ^2} + 1) \mathbb {E}_{\sigma ^2}\bigg [ \text {ln} \left( \frac{1}{\sigma ^2} \right) \bigg ] - b_{\sigma ^2} \mathbb {E}_{\sigma ^2}\bigg [ \frac{1}{\sigma ^2} \bigg ] \\ =&(a_{\sigma ^2} + 1) \left( \digamma (\nu _{a_{\sigma ^2}}) - \text {ln} (\nu _{b_{\sigma ^2}}) \right) - b_{\sigma ^2} \frac{\nu _{a_{\sigma ^2}}}{\nu _{b_{\sigma ^2}}} \end{aligned}$$
$ \mathbb {E}_{\varvec{\gamma }_j}\bigg [\text {ln } q(\varvec{\gamma }_j) \bigg ] $
$$\begin{aligned}&\qquad = \mathbb {E}_{\varvec{\gamma }_j}\bigg [ \text {ln } \bigg ( (2\pi )^{-\frac{d_j}{2}} \text {det}(\varvec{\Sigma }_j)^{-\frac{1}{2}} \text {exp}\left( -\frac{1}{2}(\varvec{\gamma }_j-\varvec{\mu }_j)'\varvec{\Sigma }_j^{-1}(\varvec{\gamma }_j-\varvec{\mu }_j) \right) \bigg ) \bigg ] \\&\qquad \propto \mathbb {E}_{\varvec{\gamma }_j}\bigg [ -\frac{1}{2} \text { ln} \big ( \text {det}(\varvec{\Sigma }_j) \big ) -\frac{1}{2}(\varvec{\gamma }_j-\varvec{\mu }_j)'\varvec{\Sigma }_j^{-1}(\varvec{\gamma }_j-\varvec{\mu }_j) \bigg ] \\&\qquad = -\frac{1}{2} \text { ln} \big ( \text {det}(\varvec{\Sigma }_j) \big ) -\frac{1}{2}\mathbb {E}_{\varvec{\gamma }_j}\bigg [ (\varvec{\gamma }_j-\varvec{\mu }_j)'\varvec{\Sigma }_j^{-1}(\varvec{\gamma }_j-\varvec{\mu }_j) \bigg ] \\&\qquad = -\frac{1}{2} \text { ln} \big ( \text {det}(\varvec{\Sigma }_j) \big ) -\frac{1}{2}\left( \mathbb {E}_{\varvec{\gamma }_j}\bigg [ \varvec{\gamma }_j'\varvec{\Sigma }_j^{-1}\varvec{\gamma }_j \bigg ] - 2\mathbb {E}_{\varvec{\gamma }_j}\bigg [ \varvec{\gamma }_j' \bigg ] \varvec{\Sigma }_j^{-1} + \varvec{\mu }_j'\varvec{\Sigma }_j^{-1}\varvec{\mu }_j \right) \\&\qquad = -\frac{1}{2} \text { ln} \big ( \text {det}(\varvec{\Sigma }_j) \big ) -\frac{1}{2}\left( \text {tr}\left( \varvec{\Sigma }_j\varvec{\Sigma }_j^{-1} \right) + \varvec{\mu }_j'\varvec{\Sigma }_j^{-1}\varvec{\mu }_j - 2\varvec{\mu }_j'\varvec{\Sigma }_j^{-1}\varvec{\mu }_j + \varvec{\mu }_j'\varvec{\Sigma }_j^{-1}\varvec{\mu }_j \right) \\&\qquad = -\frac{1}{2} \text { ln} \big ( \text {det}(\varvec{\Sigma }_j) \big ) -\underbrace{ \frac{1}{2}\text {tr}\left( \varvec{I}_j\right) }_{\text {const}} \\&\qquad \propto -\frac{1}{2} \text { ln} \big (\text {det}(\varvec{\Sigma }_j)\big ) \end{aligned}$$
For full covariance:
$$\begin{aligned} \mathbb {E}_{\varvec{\gamma }_j}\bigg [\text {ln } q(\varvec{\gamma }) \bigg ] \propto&-\frac{1}{2} \text { ln} \big (\text {det}(\varvec{\Sigma })\big ) \end{aligned}$$
$ \mathbb {E}_{\tau _j^2}\bigg [\text {ln } q(\tau _j^2) \bigg ] $
$ \mathbb {E}_{\sigma ^2}\bigg [\text {ln } q(\sigma ^2) \bigg ] $

1.6 Derivation of ELBO for SIVI

$ \mathbb {E}_{\tau _j^2 \sim q_{\varvec{\psi }_{\tau ^2_j}}} \bigg [\text {ln } p(\tau _{j}^2) \bigg ] $
$$\begin{aligned} \propto&(a_j + 1) \left( \digamma (\nu _{a_j}) - \text {ln} (\nu _{b_j}) \right) - b_j \frac{\nu _{a_j}}{\nu _{b_j}} \end{aligned}$$
$\mathbb {E}_{\sigma ^2 \sim q_{\varvec{\psi }_{\sigma ^2}}} \bigg [\text {ln } p(\sigma _s^2) \bigg ] $
$$\begin{aligned} \propto&(a_{\sigma ^2} + 1) \left( \digamma (\nu _{a_{\sigma ^2}}) - \text {ln} (\nu _{b_{\sigma ^2}}) \right) - b_{\sigma ^2} \frac{\nu _{a_{\sigma ^2}}}{\nu _{b_{\sigma ^2}}} \end{aligned}$$
$ \mathbb {E}_{\tau _j^2 \sim q_{\varvec{\psi }_{\tau ^2_j}}} \bigg [\text {ln } q(\tau _j^2) \bigg ] $
$$\begin{aligned} \propto&(\nu _{a_j}+1) \digamma (\nu _{a_j}) - \text {ln} (\nu _{b_j}) - \nu _{a_j} - \text { ln}(\Gamma (\nu _{a_j})) \end{aligned}$$
$ \mathbb {E}_{\sigma ^2 \sim q_{\varvec{\psi }_{\sigma ^2}}} \bigg [\text {ln } q(\sigma ^2) \bigg ] $
$$\begin{aligned} \propto&(\nu _{a_{\sigma ^2}}+1) \digamma (\nu _{a_{\sigma ^2}}) - \text {ln} (\nu _{b_{\sigma ^2}}) -\nu _{a_{\sigma ^2}}- \text {ln}(\Gamma (\nu _{a_{\sigma ^2}})) \end{aligned}$$
$ \mathbb {E}_{\sigma ^2 \sim q_{\varvec{\psi }_{\sigma ^2}}} \bigg [\text {ln } p(\varvec{y}|\varvec{\gamma }_s, \sigma ^2) \bigg ] $
$$\begin{aligned} \propto&\frac{n}{2} \left( \digamma (a^*) - \text {ln} (b^*) \right) -\frac{1}{2}\frac{a^*}{b^*} \Big ( (\varvec{y}-\varvec{Z\gamma }_s)'(\varvec{y}-\varvec{Z\gamma }_s) + \text {tr}\left( \varvec{Z}'\varvec{Z\Sigma }\right) \Big ) \end{aligned}$$
$ \mathbb {E}_{\tau ^2_j \sim q_{\varvec{\psi }_{\tau ^2_j}}} \bigg [\text {ln } p(\varvec{\gamma }_{j, s}|\tau _{j}^2) \bigg ] $
$$\begin{aligned} \propto&\frac{\text {rank}(\varvec{K}_j)}{2} \left( \digamma (\nu _{a_j}) - \text {ln} (\nu _{b_j}) \right) -\frac{1}{2}\frac{\nu _{a_j}}{\nu _{b_j}} \left( \text {tr}\left( \varvec{K}_j\varvec{\Sigma }_j \right) + \varvec{\gamma }_{j, s}'\varvec{K}_j\varvec{\gamma }_{j, s} \right) \end{aligned}$$
$ \text {ln } \big [ q(\varvec{\gamma }_{j, s}|\varvec{\mu }_{j, s}) + \sum _{k=1}^{K}q(\varvec{\gamma }_{j}|\varvec{\mu }_{j}^{(k)}) \big ] $
$$\begin{aligned}&\propto \text { ln } \bigg [ \sum _{k=1}^{K}\text { det}(\varvec{\Sigma }_j)^{-\frac{1}{2} } \text {exp}\left( - \frac{1}{2} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j}^{(k)} - \varvec{\mu }_{j}^{(k)})' (\varvec{\Sigma }_j)^{-1} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j}^{(k)} - \varvec{\mu }_{j}^{(k)}) \right) \\&\quad + \text { det}(\varvec{\Sigma }_j)^{-\frac{1}{2} } \text {exp}\left( - \frac{1}{2} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j, s} - \varvec{\mu }_{j, s})' (\varvec{\Sigma }_j)^{-1} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j, s} - \varvec{\mu }_{j, s}) \right) \bigg ]\\&= -\frac{1}{2} \text { ln } \text {det}(\varvec{\Sigma }_j)\\&\quad + \text { ln } \bigg [ \sum _{t=1}^{T}\text {exp}\left( - \frac{1}{2} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j}^{(t)} - \varvec{\mu }_{j}^{(t)})' (\varvec{\Sigma }_j)^{-1} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j}^{(t)} - \varvec{\mu }_{j}^{(t)}) \right) \\&\quad + \text {exp}\left( - \frac{1}{2} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j, s} - \varvec{\mu }_{j, s})' (\varvec{\Sigma }_j)^{-1} (\varvec{\gamma }_{j, s}|\varvec{\mu }_{j, s} - \varvec{\mu }_{j, s}) \right) \bigg ] \end{aligned}$$

1.7 Model specifications

1.7.1 Settings for simulation

We use P-Splines with 25 knots, third degree polynomials and 2nd order penalty.

For the Gibbs sampler, we choose a burnin of 100 and a thinning of 2. The length of the chain is then between 5000 and 12,500 depending on the scenario. Due to the very high auto-correlation in the chains for some scenarios and in particular for the scenario with 50 observations and high correlation between the covariates, we increased the the sample size.

For MFVI we use a tolerance level of 0.0001 and a maximum of iterations of 5000 to define convergence.

For SIVI and SIMFVI we employ a fully connected multilayer perceptron for $T_{\varvec{\phi }}$ and set the default input dimension to 30 as in Yin and Zhou (2018). As activation function, we use the ReLu activation function. We employ 3 hidden layers with 50, 100 and 50 neurons, respectively. The learning rates for $ \varvec{\phi } $ are 0.01, for $ \varvec{\xi } $ are 0.01, 0.005 and 0.0025 with increasing observations and for $ \varvec{\nu } $ are 0.01 with 50 and 250 observations and 0.005 with 500 observations. We initialize the neural net according to the method described in He et al. (2015). We initialize $ \varvec{\xi } $ with 0.4 and $ \varvec{\nu }_{\sigma ^2} $ with $ (\frac{N}{2}, \frac{N}{2}) $ and $ \varvec{\nu }_{\tau {_j}^2} = (e, e) $ (Euler’s number). As decay rate we choose 0.9 (and decay starts after 300 iterations). The Adam optimizer takes 2 more hyper-parameters, that the coefficients to control for the exponential decay rate that we set to $ \beta =(0.9, 0.999) $ and a term to improve numerical stability that we set to $ \epsilon =1e-08 $. For the early stopping criterion, we take the the average slope of the lower bound ELBO over the last 200 to 250 iterations (based on OLS) and stop optimizing after the slope is less then the tolerance level of 0.001 for SIVI and 0.00001 for SIMFVI. For comparison, we check the results after 5,000 iteration (without any early stopping). The presented results are the one based on early stopping.

1.7.2 Settings for application

The prior and spline settings are as follows: We use non-informative priors on the linear coefficients. For all nonlinear effects, we use Bayesian P-splines with 25 interior knots and a second-order random walk prior. For the spatial effect, we vary the number of knots in each direction. As Germany has a north–south distance of about 830 km and a west-east distance of about 650 km, using more knots in the latitude direction results in a distance-wise equal spread of knots in each direction. We, therefore, chose to place 22 knots in the longitude direction and 28 knots in the latitude direction. This results in a knot placement of about every 30 km in each direction. Choosing an equal number of knots in both directions leads to strong biases for location-specific covariates. For the random walk order, we chose the order 2 in each direction.

The settings for the algorithms that change compared to the simulation are the following: As the SIVI algorithm was very unstable, we initialize the parameters $ \varvec{\nu } $ and $ \varvec{\Sigma }_1, \dots \varvec{\Sigma }_p $ with the results obtained in MFVI (block). Additionally, we reduce the learning rates for $ \varvec{\nu } $ and for $ \varvec{\xi } $ to $ 1\times 10^{-9}$ and $ 1\times 10^{-6}$, respectively, such that no large changes for the optimized parameters occur in the optimization. For SIMFVI, we change the neural net structure. Instead of three hidden layers, we choose only two with 100 and 1000 neurons, respectively. Choosing a less deeper neural net structure for SIMFVI accelerates learning and improves the result. For MFVI, we keep the same settings except for the tolerance level^{Footnote 3} that we reduce by one decimal point. Hence, the optimization runs longer.

We also vary the length of the posterior sample size. For the Gibbs sampler, we run a chain of length 62,000. The first 2000 samples are considered burn-in and the chains are thinned to every 10th sample. One of the coefficient chains of the spline for the altitude has an effective sample size of 319.9 that is the minimum across all chains. The 1% quantile of effective sample sizes across all chains is 385.5 and the 5% quantile is 688.3. We deem this sufficient to evaluate 95% CIs of the marginal posterior distributions.

For all VI methods, we draw 3000 samples from the estimated posterior distribution for each parameter.

Using this setup, the SIVI and SIMFVI algorithm had problems fitting some of the coefficients due to occurrences of extreme covariate values. Therefore, we classified values as outliers based on the boundaries considering quartiles and the interquartile range for each covariate. Classified outliers are 2.5 to 3 times the interquartile range below or above the 25% or 75% quartile, respectively. In total, we exclude about 1.2% of the data from the analysis.

1.8 Tree and climate data sets

We use data from the national forestry inventory (NFI) of Germany. The inventory is a large-scale survey conducted in 2012. To determine the locations of the observation points for the survey, a grid is spanned all over Germany. The grid has nodes every 2, 2.83 or 4 km, depending on the region. Around each node, four measurement points are determined. The distance of the four measurement points is based on a square with a side length of 150 m, whereby each measurement point is located at the corner of the square and the node is at the center of the square. A square with measurement points is also called a tract. From each measurement point, an inventory of the surrounding trees is taken if the point is within a forest. Trees must have a minimum diameter at breast height^{Footnote 4} (DBH) of 7 cm and must belong to the circular inclusion zone around the measurement point. As the sampling method is based on angle count sampling, the inclusion zone is proportional to the basal area of the tree, whereby the basal area is the cross-sectional area at breast height (for more details see Gregoire and Valentine 2007, Ch. 8). In simplified words, the idea is to draw a virtual circle around a tree that is proportional to its DBH. If the measurement point is within the virtual circle of a surrounding tree, the tree is included in the sample. Otherwise, the tree is not part of the sample. Hence, a greater variety of tree sizes are included in the sampling without increasing the sampling effort. But also trees with larger DBH are more likely included in the sampling procedure. The study design is further illustrated in Fig. 5.

Douglas firs are observed all over Germany, but particularly frequently in north-east and west to south-west of Germany. In some cases, Douglas firs are observed in neighboring tracts, and in other cases, as shown in the middle picture of Fig. 5, tracts with Douglas fir observations are randomly dispersed over the area.

The rightmost picture in Fig. 5 exemplarily shows one tract with Douglas firs. Three of the four measurement points have Douglas fir observed (red points). For the picture, however, we use simulated data, because the exact coordinates of each tree are not provided due to data confidentiality. Publicly available are only the approximated coordinates of each tract. The data can be found at the website of the Thünen-Institute.^{Footnote 5} We only consider Douglas firs that belong to the primary stand of the observed stand around the measurement point, that is trees with similar light availability. Hence, small trees covered by larger trees and therefore, with limited light availability are excluded.

We additionally use a climate data set provided by the Nordwestdeutsche Forstliche Versuchsanstalt. The data set contains the accumulated temperature and precipitation measurements in the vegetation period over the lifespan of each Douglas fir^{Footnote 6}. The measurements are taken in various German weather stations from year 1900 until the considered NFI. As no measurements are taken between weather stations, the data is interpolated up to a 50 m scale for temperature and up to a 100 m scale for precipitation. For the interpolation, the elevation of each location and a spatial effect are considered. The model is an additive model,

$$\begin{aligned} y_i = \beta _0 + f_1(alt_{i}) + f_{\text {geo}}(long_{i}, lat_{i}) + \epsilon _i, \end{aligned}$$

with Gaussian distributed error $\epsilon _i$. The response $ y_i $ is either the summed temperature or the summed precipitation over the vegetation period. The covariate alt, the altitude, is modeled nonlinearly and the other effect type is a spatial effect on the coordinates.

The elevation data come from the Copernicus GLO-90 Digital Elevation Model.^{Footnote 7}

Algorithms

Additional tables and figures

1.1 SIVI for additive models

See Fig. 6.

1.2 Simulation results

See Table 2, 3 and Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17.

Table 2 M2 results for 1000 simulations each 50 observations and strong correlation between covariates

Full size table

Table 3 M2 results for 1000 simulations each 50 observations and strong correlation between covariates

Full size table

1.3 Tree height model

See Tables 4, 5 and Figs. 19, 20.

Table 4 Results for linear coefficients and error variance for douglas fir height model

Full size table

Table 5 Results for predictions for douglas fir height model on 30% test data set

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lichter, J., Wiemann, P.F.V. & Kneib, T. Variational inference: uncertainty quantification in additive models. AStA Adv Stat Anal 108, 279–331 (2024). https://doi.org/10.1007/s10182-024-00492-4

Download citation

Received: 28 October 2022
Accepted: 15 September 2023
Published: 03 April 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s10182-024-00492-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Variational inference: uncertainty quantification in additive models

Abstract

Similar content being viewed by others

Markov chain Monte Carlo with the Integrated Nested Laplace Approximation

Vecchia Approximations of Gaussian-Process Predictions

Spatial Models Using Laplace Approximation Methods

1 Introduction

2 Bayesian additive models

3 Variational inference in additive model

3.1 Mean-field and semi-implicit VI

3.1.1 Mean-field VI

3.1.2 Semi-implicit VI

3.2 Mean-field and semi-implicit VI for additive models

3.2.1 Mean-field VI with block-diagonal covariance matrix

3.2.2 Mean-field VI with full covariance matrix

3.2.3 Semi-implicit VI

3.2.4 Semi-implicit mean field VI

4 Simulation

5 Application to tree height models of douglas fir

5.1 Statistical additive model

5.2 Results

6 Conclusion

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Design matrices, model and derivations

1.1 Bivariate tensor product P-spline

1.2 Centering constraint in design matrix

1.3 Model likelihood and priors

1.4 Derivation of variational densities for MFVI

1.5 Derivation of ELBO for MFVI

1.6 Derivation of ELBO for SIVI

1.7 Model specifications

1.7.1 Settings for simulation

1.7.2 Settings for application

1.8 Tree and climate data sets

Algorithms

Additional tables and figures

1.1 SIVI for additive models

1.2 Simulation results

1.3 Tree height model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation