R-VGAL: A Sequential Variational Bayes Algorithm for Generalised Linear Mixed Models

Models with random effects, such as generalised linear mixed models (GLMMs), are often used for analysing clustered data. Parameter inference with these models is difficult because of the presence of cluster-specific random effects, which must be integrated out when evaluating the likelihood function. Here, we propose a sequential variational Bayes algorithm, called Recursive Variational Gaussian Approximation for Latent variable models (R-VGAL), for estimating parameters in GLMMs. The R-VGAL algorithm operates on the data sequentially, requires only a single pass through the data, and can provide parameter updates as new data are collected without the need of re-processing the previous data. At each update, the R-VGAL algorithm requires the gradient and Hessian of a"partial"log-likelihood function evaluated at the new observation, which are generally not available in closed form for GLMMs. To circumvent this issue, we propose using an importance-sampling-based approach for estimating the gradient and Hessian via Fisher's and Louis' identities. We find that R-VGAL can be unstable when traversing the first few data points, but that this issue can be mitigated by using a variant of variational tempering in the initial steps of the algorithm. Through illustrations on both simulated and real datasets, we show that R-VGAL provides good approximations to the exact posterior distributions, that it can be made robust through tempering, and that it is computationally efficient.


Introduction
Mixed models are useful for analysing clustered data, wherein observations that come from the same cluster/group are likely to be correlated.Example datasets include records of students clustered within schools, and repeated measurements of biomarkers on patients.Mixed models account for intra-group dependencies by incorporating cluster/group-specific "random effects".Inference with these models is made challenging by the fact that the likelihood function involves integrals over the random effects that are not usually tractable except for the few cases where the distribution of the random effects is conjugate to the distribution of the data, such as in the linear mixed model (Verbeke et al., 1997), the beta-binomial model (Crowder, 1979), and Rasch's Poisson count model (Jansen, 1994).Notably, there is no closed-form expression for the likelihood function in the case of the ubiquitous logistic mixed model.Maximum-likelihood-based approaches are often used for parameter inference in mixed models.In the case of linear mixed models, parameter inference via maximum likelihood estimation is straightforward (e.g., Wakefield, 2013).For mixed models with an intractable likelihood, integrals over random effects need to be numerically approximated, for example by using Gaussian quadrature (Naylor and Smith, 1982) or the Laplace approximation (Tierney and Kadane, 1986).The likelihood may also be indirectly maximised using an expectation-maximisation type algorithm (Dempster et al., 1977), which treats the random effects as missing, and iteratively maximises the "expected complete log-likelihood" of the data and the random effects.Quasi-likelihood approaches such as penalised quasi-likelihood (PQL, Breslow and Clayton, 1993) and marginal quasi-likelihood (MQL, Goldstein, 1991) approximate nonlinear mixed models with linear mixed models, so that well-developed estimation routines for linear mixed models can be applied; see Tuerlinckx et al. (2006) for a detailed discussion of these methods.These maximum-likelihood-based methods provide point estimates and not full posterior distributions over the parameters.
Full posterior distributions can be obtained using Markov chain Monte Carlo (MCMC, e.g., Zhao et al., 2006;Fong et al., 2010).MCMC provides exact, sample-based posterior distributions, but at a higher computational cost than maximum-likelihood-based methods.Alternatively, variational Bayes (VB) methods (e.g., Ong et al., 2018;Tan and Nott, 2018) are becoming increasingly popular for estimating parameters in complex statistical models.These methods approximate the exact posterior distribution with a member from a simple and tractable family of distributions; this family is usually chosen to balance the accuracy

Generalised linear mixed models
GLMMs are statistical models that contain both fixed effects and random effects.Typically, the fixed effects are common across groups, while the random effects are group-specific, and this is the setting we focus on.
Denote by y ij the jth response in the ith group, for i = 1, . . ., N groups and j = 1, . . ., n i , where n i is the number of responses in group i.Let y ≡ (y ⊤ 1 , . . ., y ⊤ N ) ⊤ be a vector of observations, where y i ≡ (y i1 , . . ., y ini ) ⊤ are the responses from the ith group.The GLMMs we consider are constructed by first assigning each y ij a distribution y ij | β, α i , ϕ ∼ p(•), where p(•) is a member of the exponential family with a dispersion parameter ϕ that is usually related to the variance of the datum, β are the fixed effect parameters, and α i are the group-specific random effects for i = 1, . . ., N .Then, the mean of the responses, where x ij is a vector of fixed effect covariates corresponding to the jth response in the ith group; z ij is a vector of predictor variables corresponding to the jth response and the ith random effect; and g(•) is a link function that links the response mean µ ij to the linear predictor x ⊤ ij β + z ⊤ ij α i .We further assume that α i ⊥ ⊥ α j for i ̸ = j.The random effects α i , for i = 1, . . ., N , are assumed to follow a normal distribution with mean 0 and covariance matrix Σ α , that is, each In practice, some structure is often assumed for the random effects covariance matrix so that it is parameterised in terms of a smaller number of parameters τ , that is, Σ α = Σ α (τ ).Inference is then made on the parameters θ = (β ⊤ , τ ⊤ , ϕ) ⊤ .
The main objective of Bayesian inference is to obtain the posterior distribution of the model parameters θ given the observations y and the prior distribution p(θ).Through Bayes rule, the posterior distribution of The likelihood function, involves integrals over the random effects α i , i = 1, . . ., N .The likelihood function can be calculated exactly for the linear mixed model with normally distributed random effects, for which for i = 1, . . ., N and j = 1, . . ., n i , where ϵ ij is a zero-mean, normally distributed error term with variance σ 2 e that is associated with the jth response from the ith group.At the group level, this model can be written as where X i = (x i1 , . . ., x ini ) ⊤ , and Z i = (z i1 , . . ., z ini ) ⊤ , with n i being the number of observations in the ith group, for i = 1, . . ., N .The likelihood function for this linear mixed model is The gradient and Hessian of the log-likelihood for the linear mixed model are also available in closed form.
However, the likelihood p(y i | α i , β, ϕ) in (3) cannot be computed exactly for general random effects models.
One important case is the logistic mixed model given by where logit(π ij ) = log πij 1−πij .The likelihood function for this model can, however, be estimated unbiasedly, as we show in Section 2.3.

The R-VGAL algorithm
VB is usually used for posterior inference in complex statistical models when inference using asymptotically exact methods such as MCMC is too costly; for a review see, for example, Blei et al. (2017).Let θ be a vector of model parameters.Here, we consider the class of VB methods where the posterior distribution p(θ | y) is approximated by a tractable density q(θ; λ) parameterised by λ.The variational parameters λ are optimised by minimising the Kullback-Leibler (KL) divergence between the variational distribution and the posterior distribution, that is, by minimising Many VB algorithms require processing the data as a batch; see, for example, Ong et al. (2018) and Tan and Nott (2018).The variational parameters λ are typically updated iteratively using stochastic gradient descent (Hoffman et al., 2013;Kingma and Welling, 2013).In settings with large amounts of data or continuously-arriving data, it is often more practical to use online or sequential variational Bayes algorithms that update the approximation to the posterior distribution sequentially as new observations become avail-able.These online/sequential algorithms are designed to handle data that are too large to fit in memory or that arrive in a continuous stream.
In a sequential VB framework, such as that proposed by Broderick et al. (2013), the observations y 1 , . . ., y N are incorporated one by one so that at iteration i, i = 1, . . ., N , one targets an approximation q i (θ) ≡ q(θ; λ i ) that is closest in a KL sense to p(θ | y 1:i ), the posterior distribution of the parameters given data up to the ith observation.In this framework, q i−1 (θ) is treated as the "prior" for the next iteration i, and the KL divergence between q i (θ) and the "pseudo-posterior" p(y i | θ)q i−1 (θ)/Z i , where is minimised at each iteration.To approximate the posterior p(θ | y 1:i ), Broderick et al. (2013) use a mean field VB approach (e.g., Ormerod and Wand, 2010), which assumes no posterior dependence between the elements of θ.The R-VGA algorithm proposed by Lambert et al. (2022) follows closely that of Broderick et al. (2013), but uses a variational distribution of the form q i (θ) = N (µ i , P i ), where P i is a full covariance matrix, and seeks closed-form updates for λ i ≡ {µ i , P i } that minimise the KL divergence between q i (θ) and Another sequential VB algorithm that is similar to that of Broderick et al. (2013) is the Updating Variational Bayes (UVB, Tomasetti et al., 2022) algorithm, which uses stochastic gradient descent (SGD, Bottou, 2010) at every iteration, i = 1, . . ., N , to minimise the KL divergence between q i (θ) and p(y i | θ)q i−1 (θ)/Z i .One advantage of R-VGA compared to UVB is that it does not require running a full optimisation algorithm at each iteration, since updates for λ i are available in closed form.
The R-VGA algorithm requires the gradient ∇ θ log p(y i | θ) and Hessian ∇ 2 θ log p(y i | θ) of the "partial" log-likelihood for the ith observation.However, for the GLMMs discussed in Section 2.1, there are usually no closed-form expressions for said quantities.Our R-VGAL algorithm circumvents this issue by replacing respectively.These unbiased estimates are obtained by using an importance-sampling-based approach applied to Fisher's and Louis' identities (Cappé et al., 2005), which we discuss in more detail in Section 2.3.We summarise the R-VGAL algorithm in Algorithm 1.

Approximations of the likelihood gradient and Hessian
This section discusses approaches to obtain unbiased estimates of the gradient and the Hessian of the loglikelihood with respect to the parameters.

Approximation of the gradient with Fisher's identity
Consider the ith iteration.Fisher's identity (Cappé et al., 2005) for the gradient of log p(y i | θ) is If it is possible to sample directly from p(α i | y i , θ) (e.g., as it is with the linear random effects model in Section 3.1), the above identity can be approximated by In the case where direct sampling from p(α i | y i , θ) is difficult, we use importance sampling (e.g., Gelman et al., 2004) to estimate the gradient of the log-likelihood in (9).Specifically, we draw samples {α (s) i : s = 1, . . ., S α } from an importance distribution r(α i | y i , θ), and then compute the weights The gradient of the log-likelihood is then approximated as where W i ≡ { w(s) i : s = 1, . . ., S α } are the normalised weights given by w(s One possible choice for the importance distribution is the distribution of the random effects, that is, p(α i | θ).
In this case, the weights W i reduce to We use this importance distribution in all of the illustrations in Section 3.

Approximation of the Hessian with Louis' identity
Consider again the ith iteration.Louis' identity (Cappé et al., 2005) for the Hessian where The first term on the right-hand side of ( 12) is obtained using Fisher's identity, as discussed in Section 2.3.1.The second term consists of two integrals (see ( 13)), which can also be approximated using samples. Specifically, importance sampling (as in Section 2.3.1)can be used instead.In this case, for computational efficiency, the same samples {α (s) i : s = 1, . . ., S α } that were used to approximate the score using Fisher's identity, along with their corresponding normalised weights W i , can also be used to obtain the estimates of the second term in Louis' identity.Then i | θ) .

Variational tempering
A possible problem with R-VGAL is its instability during the first few iterations.Figures S13 and S14 in Section S3 of the online supplement show that the first few observations can heavily influence the trajectory of the variational mean, making the R-VGAL parameter estimates sensitive to the ordering of the observations.
Here, we propose a tempering approach to stabilise the R-VGAL algorithm during the initial few steps.
In standard batch variational tempering (e.g., Mandt et al., 2016;Huang et al., 2018), tempering is used to smooth out the objective function by "down weighting" the likelihood p(y | θ) with a tempering sequence We propose a version of R-VGAL with tempering as shown in Algorithm 2. In this version, the R-VGAL updates of the mean and precision matrix for each observation are split into K steps.In each step, we multiply the gradient and the Hessian of log p(y i | θ) by a factor a = 1 K (which acts as a "step size"), and then update the variational parameters K times during the ith iteration.Using a smaller step size helps stabilise the R-VGAL algorithm, particularly during the first few observations.Section S3 of the online supplement shows that adding tempering for the first few observations makes the R-VGAL algorithm more robust to different orderings of the data.

Algorithm 2 R-VGAL with tempering
Input: observations y 1 , . . ., y N , initial values µ 0 and P 0 , number of observations to temper n temp , number of tempering steps K Output: variational parameters µ i and P i , for i = 1, ..., N .
Set q 0 (θ) = N (µ 0 , P 0 ) end for 3 Applications of R-VGAL In this section, we apply R-VGAL to estimate parameters in linear and logistic mixed models using two simulated datasets and two real datasets: the Six City dataset from Fitzmaurice and Laird (1993), and the POLYPHARMACY dataset from Hosmer et al. (2013).We validate R-VGAL against Hamiltonian Monte Carlo (HMC, Neal, 2011;Betancourt and Girolami, 2015), which we implement using the Stan programming language (Stan Development Team, 2023) in R software (R Core Team, 2022).For all examples, we run HMC for 15000 iterations and discard the first 5000 as burn in.Reproducible code for all examples is available on https://github.com/bao-anh-vu/R-VGAL.

Linear mixed effect model
In this example, we generate data from a linear mixed model with N = 200 groups and n = 10 responses per group.The jth response from the ith group is modelled as for i = 1, . . ., N and j = 1, . . ., n, where x ij is drawn from a N (0, I 4 ) distribution and z ij is drawn from a N (0, 1) distribution.The true parameter values are β = (−1.5,1.5, 0.5, 0.25) ⊤ , σ α = 0.9, and σ ϵ = 0.7.
At the group level, the linear mixed model is where , and ε i = (ϵ i1 , . . ., ϵ in ) ⊤ .At each iteration, i = 1, . . ., N , the R-VGAL algorithm makes use of the "partial" likelihood of the observations from the ith group, p(y i | θ) = N (µ y|θ , Σ y|θ ), where µ y|θ = X i β and For this model, the gradient and Hessian of log p(y i | θ) with respect to each of the parameters are available in closed form; see Section S1.1 of the online supplement.In this case, we are therefore able to compare the accuracy of R-VGAL implemented using approximate gradients and Hessians with that of R-VGAL implemented using exact gradients and Hessians.
The prior distribution we use, which is also the "initial" variational distribution, is A N (log(0.5 2 ), 1) prior distribution for ϕ α and ϕ ϵ is equivalent to a log-normal prior distribution with mean 0.41 and variance 0.29 for both σ 2 α and σ 2 ϵ .Using this prior distribution, the 2.5th and 97.5th percentiles for both σ 2 α and σ 2 ϵ are (0.035, 1.775).
We run R-VGAL with variational tempering as described in Section 2.4 with n temp = 10 tempered obser-vations and K = 4 tempering steps per observation.At each iteration i = 1, . . ., 200, when approximating the gradient and Hessian of log p(y i | θ) using Fisher's and Louis' identities, we use S α = 100 Monte Carlo samples (of α i ).When approximating the expectations with respect to q i−1 (θ) in the R-VGAL updates of the mean and precision matrix, we use S = 100 Monte Carlo samples (of θ).This value of S was chosen based on an experimental study on the effect of S and S α on the posterior estimates of R-VGAL in Section S2 of the online supplement.
We validate R-VGAL against HMC (Neal, 2011;Betancourt and Girolami, 2015), which we implemented in Stan. Figure 1 shows the marginal posterior distributions of the parameters, along with bivariate posterior distributions as estimated using R-VGAL with approximate gradients and Hessians, R-VGAL with exact gradients and Hessians, and HMC.The posterior distributions obtained using R-VGAL are clearly very similar to those obtained using HMC, irrespective of whether exact or approximate gradients and Hessians are used.

Logistic mixed effect model
In this example, we generate simulated data from a random effect logistic regression model with N = 500 groups and n = 10 responses per group.The random effect logistic regression model we use is where x ij is drawn from a N (0, I 4 ) distribution, and z ij is drawn from a N (0, 1) distribution, for i = 1, . . ., N and j = 1, . . ., n.The true parameter values are β = (−1.5,1.5, 0.5, 0.25) ⊤ and τ = 0.9.
As in the linear case, although the parameters of the model are β and τ , we work with θ = (β ⊤ , ϕ τ ) ⊤ where ϕ τ = log(τ 2 ).The gradient and Hessian of the "partial" log-likelihood p(y i | θ) in this model are not analytically tractable, but can be estimated unbiasedly using Fisher's and Louis' identities as discussed in Section 2.3.These identities require the expressions for ∇ θ log p(y i , α i | θ) and ∇ 2 θ log p(y i , α i | θ), which are provided in Section S1.2 of the online supplement.
The prior distribution we use, which is also the "initial" variational distribution, is Figure 1: Exact posterior distributions (from HMC, in blue) and approximate posterior distributions (from R-VGAL with estimated gradients and Hessians in red, and from R-VGAL with exact gradients and Hessians in yellow) for the linear mixed model experiment.Diagonal panels: Marginal posterior distributions with true parameters denoted using dotted lines.Off-diagonal panels: Bivariate posterior distributions with true parameters denoted using the symbol ×.
We run R-VGAL with variational tempering as described in Section 2.4 with n temp = 10 tempered observations and K = 4 tempering steps per observation.At each iteration i = 1, . . ., 500, the gradient and Hessian of log p(y i | θ) are estimated using S α = 100 Monte Carlo samples (of α i ), while the expectations with respect to q i−1 (θ) in the R-VGAL updates of the mean and precision matrix are estimated with S = 100 Monte Carlo samples (of θ).
Figure 2 shows the marginal posterior distributions of the parameters, along with bivariate posterior distributions as estimated using R-VGAL and HMC.The posterior distributions obtained using R-VGAL are again very similar to those obtained using HMC.

Real data examples
We now apply R-VGAL to two real datasets: the Six City dataset from Fitzmaurice and Laird (1993), and the POLYPHARMACY dataset from Hosmer et al. (2013).
For the Six City dataset, we follow Tran et al. (2017) and consider the random intercept logistic regression model log where π ij = p(y ij = 1 | β, τ 2 ), with β = (β 1 , β 2 , β 3 ) ⊤ , for i = 1, . . ., 537 and j = 1, . . ., 4. The binary response variable y ij = 1 if child i is wheezing at time point j, and 0 otherwise.The covariate Age ij is the age of child i at time point j, centred at 9 years, while the covariate Smoke ij = 1 if the mother of child i is smoking at time point j, and 0 otherwise.Finally, α i is the random effect associated with the ith child.The parameters of the model are θ = (β ⊤ , ϕ τ ) ⊤ , where ϕ τ = log(τ 2 ).
For the POLYPHARMACY dataset, we consider the random intercept logistic regression model from Tan and Nott (2018): where , for i = 1, . . ., 500 and j = 1, . . ., 7. The response variable y ij is 1 if subject i in year j is taking drugs from three or more different classes (of drugs), and 0 otherwise.The covariate Gender i = 1 if subject i is male, and 0 if female, while Race i = 0 if the race of subject i is white, and 1 otherwise.The covariate Age ij is the age (in years and months, to two decimal places) of subject i in year j.The number of outpatient mental health visits (MHV) for subject i in year j is split into three dummy variables: MHV1 ij = 1 if 1 ≤ MHV ij ≤ 5, and 0 otherwise; INPTMHV ij = 0 if there were no inpatient mental health visits for subject i in year j, and 1 otherwise.Finally, α i is a subject-level random effect for subject i.The parameters of the model are θ = (β ⊤ , ϕ τ ) ⊤ , where We use similar priors/initial variational distributions for both examples.For the Six City dataset, the prior/initial variational distribution we use is and for the POLYPHARMACY dataset, we use A N (1, 1) prior distribution for ϕ τ is equivalent to a log-normal prior distribution with mean 4.48 and variance 34.51 for τ 2 .Using this prior distribution, the 2.5th and 97.5th percentiles for τ 2 are (0.383, 19.297), which cover most values of τ 2 in practice.
On both datasets, we apply the R-VGAL with tempering algorithm with n temp = 10 observations and K = 4 tempering steps per observation.At each R-VGAL iteration, the gradient and Hessian of log p(y i | θ) are approximated using S α = 200 Monte Carlo samples (of α i ), and the expectations with respect to q i−1 (θ) in Figure 3 shows the marginal posterior distributions of the parameters along with bivariate posterior distributions estimated using R-VGAL and HMC for the Six City dataset.Figure 4 shows the same plots for the POLYPHARMACY dataset.In the Six City example, there is a slight difference in the marginal and bivariate posterior densities from R-VGAL and HMC for the fixed effect β smoke , but the posterior densities for other parameters are very similar between the two methods.For the POLYPHARMACY example, there are slight differences between the R-VGAL and HMC marginal and bivariate posterior densities for the intercept β 0 and the fixed effects β gender and β race , but for other parameters, the posterior densities are comparable between the two methods.
Table 1 compares the computing time of R-VGAL and HMC for both simulated and real data examples discussed in this section.The last column shows the average time taken for a single iteration of R-VGAL.
For the linear example, where we run R-VGAL with both the theoretical and estimated gradients/Hessians, the displayed time is that of the R-VGAL with the estimated gradients/Hessians.All experiments were carried out on a high-end desktop computer with 64GB of RAM, an Intel ® Core TM i9-7900X CPU, and an NVIDIA ® 1080Ti graphics processing unit (GPU).The table shows that the R-VGAL algorithm is generally parameters to estimate is highest, R-VGAL is an order of magnitude faster than HMC, which is substantial given that our code is not as highly optimised as that in Stan.Clearly, we can expect R-VGAL to be increasingly useful as the dimensionality of the parameter space increases.Furthermore, since R-VGAL is a sequential algorithm, posterior approximations from R-VGAL can be easily updated as new observations become available.To incorporate an additional observation, R-VGAL need only perform a single update, while an algorithm like HMC requires rerunning from scratch.

Conclusion
In this article, we propose a sequential variational Bayes algorithm for estimating parameters in generalised linear mixed models (GLMMs) based on an extension of the R-VGA algorithm of Lambert et al. (2022).The original R-VGA algorithm requires the gradient and Hessian of the partial log-likelihood at each observation, which are computationally intractable for GLMMs.To overcome this, we use Fisher's and Louis' identities to obtain unbiased estimates of the gradient and Hessian, which can be used in place of the closed form gradient and Hessian in the R-VGAL algorithm.
We apply R-VGAL to the linear mixed effect and logistic mixed effect models with simulated and real datasets.In all examples, we compare the posterior distributions of the parameters estimated using R-VGAL to those obtained using HMC (Neal, 2011;Betancourt and Girolami, 2015).The examples show that R-VGAL yields comparable posterior estimates to HMC while being substantially faster.R-VGAL would be especially useful in situations where new observations are being continuously collected.
In the current paper, we assume that the random effects are independent and identically distributed.Future work will attempt to extend R-VGAL to cases where the random effects are correlated.This will expand the set of models on which R-VGAL can be used to include time series and state space models.
S1 Appendix A: Gradient and Hessian derivations S1.1 Derivation of gradient and Hessian expressions for the linear mixed model In the linear mixed model, the theoretical gradient and Hessian for the likelihood at observation i for i = 1, . . ., N are available.This section gives the expressions for both the theoretical gradient and Hessian, as well as their approximations when using Fisher's and Louis' identities.

S1.1.1 Theoretical gradient and Hessian
The "partial" log-likelihood for observations from the ith group is given by where The gradient of the log-likelihood with respect to the parameters θ is given by where the components are, respectively, with The Hessian is given by The diagonal terms in the Hessian are the second derivatives where with The expression for ∇ 2 ϕϵ log p(y i | θ) is the same as in (S8), but with all derivatives with respect to ϕ α replaced by those with respect to ϕ ϵ .Note that The off-diagonal terms in the Hessian are where S1.1.2Expressions for the gradient and Hessian using Fisher's and Louis' identities Fisher's identity (9) requires the gradient ∇ θ log p(y i , α i | θ), while Louis' identity (12) requires the Hessian ∇ 2 θ log p(y i , α i | θ).We now give the expression for these quantities for the linear mixed model considered in Section 3.1.
For i = 1, . . ., N , the gradient ∇ θ log p(y i , α i | θ) can be written as Thus, the gradient of the joint where each component is given by Similarly, the approximation of the Hessian using Louis' identity requires ∇ 2 θ log p(y i , α i | θ), in particular,  posterior estimates across multiple independent runs.

S2.2 POLYPHARMACY dataset
The POLYPHARMACY dataset used in this section is the same as that used in Section 3.3 of the main paper.Similar to Section S2.1, the same values for S and S α are used and they are taken from the set {50, 100, 500, 1000}.We independently run R-VGAL 10 times on the POLYPHARMACY dataset for each pair of S and S α values, and plot the posterior densities from all 10 runs for each parameter.For comparison, the HMC posterior distributions (from a single run) are also plotted.As with the previous example, the results in this example also show that increasing the values of S and S α reduces the variability of the R-VGAL posterior estimates across multiple runs.This phenomenon is particularly pronounced for the random effect standard deviation τ .Suitable values for S and S α are likely to be application-dependent.However, from our studies, we conclude that S and S α need to be at least 100 for the Monte Carlo sample sizes not to have a substantial effect on the final estimates.

S3 Robustness check of the R-VGAL algorithm
In this section, we use the POLYPHARMACY dataset in Section 3.3 to check the robustness of the R-VGAL algorithm given different orderings of the data.The simulations in this section show that R-VGAL can be unstable while traversing the first few observations, which makes it sensitive to the ordering of observations.This instability can, however, be alleviated with variational tempering, as described in Section 2.4 of the main paper.
Figures S9 and S10 show the R-VGAL posterior density estimates from 10 independent runs using the original ordering of the data and a random ordering of the data, respectively.In both simulations, the number of Monte Carlo samples S to estimate the expectation with respect to q i−1 (θ) and the number of samples S α to estimate the gradients/Hessians are fixed to 100.In both figures, the HMC posterior densities for each parameter are plotted in black for comparison.The figures show that the R-VGA estimates are quite far away from those of HMC estimates when using the original ordering of the data, while the R-VGA estimates are reasonably close to those of HMC when using the random ordering of the data.This suggests that the R-VGAL estimates are not robust with respect to the ordering of the data.
To confirm that the source of variability in the R-VGAL estimates is from different data orderings and not from the low number of Monte Carlo samples, we increase the number of Monte Carlo samples S and S α .to the biased posterior mean of τ .In Figure S14, where the data were randomly reordered, the trajectories of the parameters are less variable initially, which then allows the variational mean to converge towards the true values more rapidly.This shows that the R-VGAL algorithm is unstable while traversing the first few observations, making the algorithm sensitive to the ordering of the data.
We propose a tempering approach (in Section 2.4 of the main paper) to make the R-VGAL algorithm more robust.By tempering the first few observations, the R-VGAL posterior estimates become much more consistent across different data orderings.Figures S15 and S16 show the posterior density estimates from 10 independent runs of the R-VGAL algorithm using the original ordering of the data and a random ordering of the data, respectively, with tempering done on the first 10 observations.These figures show that the posterior density estimates of R-VGAL with tempering are consistent across two different orderings of the data, and also consistent with those obtained from HMC.
Figures S17 and S18 display the R-VGAL posterior densities using the original ordering and the random thereby preventing the optimisation from getting stuck in local optima.Tempering is done by optimising the KL divergence between the variational objective q(θ; λ) and a sequence of "tempered" distributions given by η m (•) = p(y | θ) am p(θ)/Z m , where Z m = p(y | θ) αm p(θ) dθ, for m = 1, . . ., M .The original objective function is recovered when a m = 1.

Figure 2 :
Figure 2: Exact posterior distributions from HMC (in blue) and approximate posterior distributions from R-VGAL with estimated gradients and Hessians (in red) for the logistic mixed model experiment.Diagonal panels: Marginal posterior distributions with true parameters denoted using dotted lines.Off-diagonal panels: Bivariate posterior distributions with true parameters denoted using the symbol ×.

Figure 3 :
Figure 3: Exact posterior distributions from HMC (in blue) and approximate posterior distributions from R-VGAL with estimated gradients and Hessians (in red) for the experiment with the Six City dataset.Diagonal panels: Marginal posterior distributions.Off-diagonal panels: Bivariate posterior distributions.

Figure 4 :
Figure 4: Exact posterior distributions from HMC (in blue) and approximate posterior distributions from R-VGAL with estimated gradients and Hessians (in red) for the experiment with the POLYPHARMACY dataset.Diagonal panels: Marginal posterior distributions.Off-diagonal panels: Bivariate posterior distributions.

Figure
Figure S3: R-VGAL posterior distributions from 10 runs on simulated logistic data are plotted in colour.The Monte Carlo sample sizes are S = 500, S α = 500.HMC posterior distribution are plotted in black for comparison.

Figure
Figure S4: R-VGAL posterior distributions from 10 runs on simulated logistic data are plotted in colour.The Monte Carlo sample sizes are S = 1000, S α = 1000.HMC posterior distribution are plotted in black for comparison.

Figure
Figure S5: R-VGAL posterior distributions from 10 runs on the POLYPHARMACY dataset are plotted in colour.Tempering is done on the first 10 observations, and the Monte Carlo sample sizes are S = 50, S α = 50.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S6: R-VGAL posterior distributions from 10 runs on the POLYPHARMACY dataset are plotted in colour.Tempering is done on the first 10 observations, and the Monte Carlo sample sizes are S = 100, S α = 100.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S7: R-VGAL posterior distributions from 10 runs on the POLYPHARMACY dataset are plotted in colour.Tempering is done on the first 10 observations, and the Monte Carlo sample sizes are S = 500, S α = 500.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S8: R-VGAL posterior distributions from 10 runs on the POLYPHARMACY dataset are plotted in colour.Tempering is done on the first 10 observations, and the Monte Carlo sample sizes are S = 1000, S α = 1000.HMC posterior distributions are plotted in black for comparison.

Figures
Figures S11 and S12show the R-VGAL posterior density estimates from 10 independent runs using the original ordering of the data and a random ordering of the data, respectively, with the Monte Carlo sample sizes set to S = S α = 1000.The posterior densities for each parameter are different for the two orderings; for instance, with the original ordering, the posterior of τ is centred around 4.5, while with the random ordering, the posterior of τ is centred around 2.4.

Figure
Figure S10: R-VGAL posterior distributions from 10 independent runs using the random ordering of the data are plotted in different colours.The Monte Carlo sample sizes are S = 100, and S α = 100.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S11: R-VGAL posterior distributions from 10 independent runs using the original ordering of the data are plotted in colour.The Monte Carlo sample sizes are S = 1000, S α = 1000.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S12: R-VGAL posterior distributions from 10 independent runs using the random ordering of the data are plotted in colour.The Monte Carlo sample sizes are S = 1000, S α = 1000.HMC posterior distributions are plotted in black for comparison.

Figure S13 :
Figure S13: Trajectories of the variational mean without tempering (in blue) and with tempering (in red) for each parameter across 10 independent runs of R-VGAL on the original ordering of the data.The Monte Carlo sample sizes are S = 1000 and S α = 1000.

Figure S14 :
Figure S14: Trajectories of the variational mean without tempering (in blue) and with tempering (in red) for each parameter across 10 independent runs of R-VGAL on the random ordering of the data.The Monte Carlo sample sizes are S = 1000 and S α = 1000.

Figure
Figure S15: R-VGAL posterior distributions from 10 independent runs using the original ordering of the data are plotted in colour.Tempering is done on the first 10 observations.The Monte Carlo sample sizes are S = 100, S α = 100.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S16: R-VGAL posterior distributions from 10 independent runs using a random reordering of the data are plotted in colour.Tempering is done on the first 10 observations.The Monte Carlo sample sizes are S = 100, S α = 100.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S17: R-VGAL posterior distributions from 10 independent runs using the original ordering of the data are plotted in colour.Tempering is done on the first 10 observations.The Monte Carlo sample sizes are S = 1000, S α = 1000.HMC posterior distributions are plotted in black for comparison.

Figure
Figure S18: R-VGAL posterior distributions from 10 independent runs using a random reordering of the data are plotted in colour.Tempering is done on the first 10 observations.The Monte Carlo sample sizes are S = 1000, S α = 1000.HMC posterior distributions are plotted in black for comparison.

Table 1 :
The computing time (in seconds) for the R-VGAL and HMC methods for the simulated and real datasets.