Monte Carlo co-ordinate ascent variational inference

In variational inference (VI), coordinate-ascent and gradient-based approaches are two major types of algorithms for approximating difﬁcult-to-compute probability densities. In real-world implementations of complex models, Monte Carlo methods are widely used to estimate expectations in coordinate-ascent approaches and gradients in derivative-driven ones. We discuss a Monte Carlo co-ordinate ascent VI (MC-CAVI) algorithm that makes use of Markov chain Monte Carlo (MCMC) methods in the calculation of expectations required within co-ordinate ascent VI (CAVI). We show that, under regularity conditions, an MC-CAVI recursion will get arbitrarily close to a maximiser of the evidence lower bound with any given high probability. In numerical examples, the performance of MC-CAVI algorithm is compared with that of MCMC and—as a representative of derivative-based VI methods—of Black Box VI (BBVI). We discuss and demonstrate MC-CAVI’s suitability for models with hard constraints in simulated and real examples. We compare MC-CAVI’s performance with that of MCMC in an important complex model used in nuclear magnetic resonance spectroscopy data analysis—BBVI is nearly impossible to be employed in this setting due to the hard constraints involved in the model.


Introduction
Variational inference (VI) (Jordan et al. 1999;Wainwright et al. 2008) is a powerful method to approximate intractable integrals.As an alternative strategy to Markov chain Monte Carlo (MCMC) sampling, VI is fast, relatively straightforward for monitoring convergence and typically easier to scale to large data (Blei et al. 2017) than MCMC.The key idea of VI is to approximate difficult-to-compute conditional densities of latent variables, given observations, via use of optimization.A family of distributions is assumed for the latent variables, as an approximation to the exact conditional distribution.VI aims at finding the member, amongst the selected family, that minimizes the Kullback-Leibler (KL) divergence from the conditional law of interest.
Let x and z denote, respectively, the observed data and latent variables.The goal of the inference problem is to identify the conditional density (assuming a relevant reference measure, e.g.Lebesgue) of latent variables given observations, i.e. p(z|x).Let L denote a family of densities defined over the space of latent variables-we denote members of this family as q = q(z) below.The goal of VI is to find the element of the family closest in KL divergence to the true p(z|x).Thus, the original inference problem can be rewritten as an optimization one: identify q * such that q * = argmin q∈L KL(q | p(•|x)) (1) for the KL-divergence defined as KL(q | p(•|x)) = E q [log q(z)] − E q [log p(z|x)] = E q [log q(z)]−E q [log p(z, x)]+ log p(x), with log p(x) being constant w.r.t.z.Notation E q refers to expectation taken over z ∼ q.Thus, minimizing the KL divergence is equivalent to maximising the evidence lower bound, ELBO(q), given by ELBO(q) = E q [log p (z, x)] − E q [log q(z)]. (2) Let S p ⊆ R m , m ≥ 1, denote the support of the target p(z|x), and S q ⊆ R m the support of a variational density q ∈ Lassumed to be common over all members q ∈ L. Necessarily, S p ⊆ S q , otherwise the KL-divergence will diverge to +∞.
Many VI algorithms focus on the mean-field variational family, where variational densities in L are assumed to factorise over blocks of z.That is, (3) for individual supports S q i ⊆ R m i , m i ≥ 1, 1 ≤ i ≤ b, for some b ≥ 1, and i m i = m.It is advisable that highly correlated latent variables are placed in the same block to improve the performance of the VI method.There are, in general, two types of approaches to maximise ELBO in VI: a co-ordinate ascent approach and a gradientbased one.Co-ordinate ascent VI (CAVI) (Bishop 2006) is amongst the most commonly used algorithms in this context.To obtain a local maximiser for ELBO, CAVI sequentially optimizes each factor of the mean-field variational density, while holding the others fixed.Analytical calculations on function space-involving variational derivatives-imply that, for given fixed q 1 , . . ., q i−1 , q i+1 , . . ., q b , ELBO(q) is maximised for where z −i := (z i − , z i + ) denotes vector z having removed component z i , with i − (resp.i + ) denoting the ordered indices that are smaller (resp.larger) than i; E −i is the expectation taken under z −i following its variational distribution, denoted q −i .The above suggest immediately an iterative algorithm, guaranteed to provide values for ELBO(q) that cannot decrease as the updates are carried out.The expected value E −i [log p(z i − , z i , z i + , x)] can be difficult to derive analytically.Also, CAVI typically requires traversing the entire dataset at each iteration, which can be overly computationally expensive for large datasets.Gradient-based approaches, which can potentially scale up to large data-alluding here to recent Stochastic-Gradient-type methods-can be an effective alternative for ELBO optimisation.However, such algorithms have their own challenges, e.g. in the case of reparameterization Variational Bayes (VB) analytical derivation of gradients of the log-likelihood can often be problematic, while in the case of score-function VB the requirement of the gradient of log q restricts the range of the family L we can choose from.
In real-world applications, hybrid methods combining Monte Carlo with recursive algorithms are common, e.g., Auto-Encoding Variational Bayes, Doubly-Stochastic Variational Bayes for non-conjugate inference, Stochastic Expectation-Maximization (EM) (Beaumont et al. 2002;Sisson et al. 2007;Wei and Tanner 1990).In VI, Monte Carlo is often used to estimate the expectation within CAVI or the gradient within derivative-driven methods.This is the case, e.g., for Stochastic VI (Hoffman et al. 2013) and Black-Box VI (BBVI) (Ranganath et al. 2014).
BBVI is used in this work as a representative of gradientbased VI algorithms.It allows carrying out VI over a wide range of complex models.The variational density q is typically chosen within a parametric family, so finding q * in (1) is equivalent to determining an optimal set of parameters that characterize and can be approximated by black-box Monte Carlo estimators as, e.g., The approximated gradient of ELBO can then be used within a stochastic optimization procedure to update λ at the kth iteration with where {ρ k } k≥0 is a Robbins-Monro-type step-size sequence (Robbins and Monro 1951).As we will see in later sections, BBVI is accompanied by generic variance reduction methods, as the variability of (6) for complex models can be large.
Remark 1 (Hard Constraints) Though gradient-based VI methods are some times more straightforward to apply than co-ordinate ascent ones,-e.g.combined with the use of modern approaches for automatic differentiation (Kucukelbir et al. 2017)-co-ordinate ascent methods can still be important for models with hard constraints, where gradient-based algorithms are laborious to apply.(We adopt the viewpoint here that one chooses variational densities that respect the constraints of the target, for improved accuracy.)Indeed, notice in the brief description we have given above for CAVI and BBVI, the two methodologies are structurally different, as CAVI does not necessarily require to be be built via the introduction of an exogenous variational parameter λ.Thus, in the context of a support for the target p(z|x) that involves complex constraints, a CAVI approach overcomes this issue naturally by blocking together the z i 's responsible for the constraints.In contrast, introduction of the variational parameter λ creates sometimes severe complications in the development of the derivative-driven algorithm, as normalising constants that depend on λ are extremely difficult to calculate analytically and obtain their derivatives.Thus, a main argument spanning this work-and illustrated within it-is that co-ordinate-ascent-based VI methods have a critical role to play amongst VI approaches for important classes of statistical models.

Remark 2
The discussion in Remark 1 is also relevant when VB is applied with constraints imposed on the variational parameters.E.g. the latter can involve covariance matrices, whence optimisation has to be carried out on the space of symmetric positive definite matrices.Recent attempts in the VB field to overcome this issue involves updates carried out on manifolds, see e.g.Tran et al. (2019).
The main contributions of the paper are: (i) We discuss, and then apply a Monte Carlo CAVI (MC-CAVI) algorithm in a sequence of problems of increasing complexity, and study its performance.As the name suggests, MC-CAVI uses the Monte Carlo principle for the approximation of the difficult-to-compute conditional expectations, The rest of the paper is organised as follows.Section 2 presents briefly the MC-CAVI algorithm.It also provides-in a specified setting-an analytical result illustrating nonaccumulation of Monte Carlo errors in the execution of the recursions of the algorithm.That is, with a probability arbitrarily close to 1, the variational solution provided by MC-CAVI can be as close as required to the one of CAVI, for a big enough Monte Carlo sample size, regardless of the number of algorithmic iterations.Section 3 shows two numerical examples, contrasting MC-CAVI with alternative algorithms.Section 4 presents an implementation of MC-CAVI in a real, complex, challenging posterior distribution arising in metabolomics.This is a practical application, involving hard constraints, chosen to illustrate the potential of MC-CAVI in this context.We finish with some conclusions in Sect. 5.

Description of the algorithm
We begin with a description of the basic CAVI algorithm.A double subscript will be used to identify block variational densities: q i,k (z i ) (resp.q −i,k (z −i )) will refer to the density of the ith block (resp.all blocks but the ith), after k updates have been carried out on that block density (resp.k updates have been carried out on the blocks preceding the ith, and k − 1 updates on the blocks following the ith).
Assume that the expectations E −i [log p(z, x)], {i : i ∈ I}, for an index set I ⊆ {1, . . ., b}, can be obtained analytically, over all updates of the variational density q(z); and that this is not the case for i / ∈ I. Intractable integrals can be approximated via a Monte Carlo method.(As we will see in the applications in the sequel, such a Monte Carlo device typically uses samples from an appropriate MCMC algorithm.)In particular, for i / ∈ I, one obtains N ≥ 1 samples from the current q −i (z −i ) and uses the standard Monte Carlo estimate

Applicability of MC-CAVI
We discuss here the class of problems for which MC-CAVI can be applied.It is desirable to avoid settings where the order of samples or statistics to be stored in memory increases with the iterations of the algorithm.To set-up the ideas we begin with CAVI itself.Motivated by the standard exponential class of distributions, we work as follows.
Consider the case when the target density p(z, x) ≡ f (z)-we omit reference to the data x in what follows, as x is fixed and irrelevant for our purposes (notice that f is not required to integrate to 1)-is assumed to have the structure, for s-dimensional constant vector η = (η 1 , . . ., η s ), vector function T (z) = (T 1 (z), . . ., T s (z)), with some s ≥ 1, and relevant scalar functions h > 0, A; •, • is the standard inner product in R s .Also, we are given the choice of blockvariational densities q 1 (z 1 ), . . ., q b (z b ) in (3).Following the definition of CAVI from Sect.2.1-assuming that the algorithm can be applied, i.e. all required expectations can be obtained analytically-the number of 'sufficient' statistics, say T i,k giving rise to the definition of q i,k will always be upper bounded by s.Thus, in our working scenario, CAVI will be applicable with a computational cost that is upper bounded by a constant within the class of target distributions in (8)-assuming relevant costs for calculating expectations remain bounded over the algorithmic iterations.
Moving on to MC-CAVI, following the definition of index set I in Sect.2.1, recall that a Monte Carlo approach is required when updating q i (z i ) for i / ∈ I, 1 ≤ i ≤ b.In such a scenario, controlling computational costs amounts to having a target (8) admitting the factorisations, Once ( 9) is satisfied, we do not need to store all N samples from q −i (z −i ), but simply some relevant averages keeping the cost per iteration for the algorithm bounded.We stress that the combination of characterisations in ( 8)-( 9) is very general and will typically be satisfied for most practical statistical models.

Theoretical justification of MC-CAVI
An advantageous feature of MC-CAVI versus derivativedriven VI methods is its structural similarity with Monte Carlo Expectation-Maximization (MCEM).Thus, one can build on results in the MCEM literature to prove asymptotical properties of MC-CAVI; see e.g.Chan and Ledolter (1995), Booth and Hobert (1999), Levine and Casella (2001), Fort and Moulines (2003).To avoid technicalities related with working on general spaces of probability density functions, we begin by assuming a parameterised setting for the variational densities-as in the BBVI case-with the family of variational densities being closed under CAVI or (more generally) MC-CAVI updates.
Assumption 1 (Closedness of Parameterised q(•) Under Variational Update) For the CAVI or the MC-CAVI algorithm, each q i,k (z i ) density obtained during the iterations of the algorithm, 1 ≤ i ≤ b, k ≥ 0, is of the parametric form Under Assumption 1, CAVI and MC-CAVI can be corresponded to some well-defined maps M : → , M N : → respectively, so that, given current variational parameter λ, one step of the algorithms can be expressed in terms of a new parameter λ (different for each case) obtained via the updates For an analytical study of the convergence properties of CAVI itself and relevant regularity conditions, see e.g.(Bertsekas 1999, Proposition 2.7.1 ), or numerous other resources in numerical optimisation.Expressing the MC-CAVI update-say, the (k + 1)th one-as it can be seen as a random perturbation of a CAVI step.In the rest of this section we will explore the asymptotic properties of MC-CAVI.We follow closely the approach in Chan and Ledolter (1995)-as it provides a less technical procedure, compared e.g. to Fort and Moulines (2003) or other works about MCEM-making all appropriate adjustments to fit the derivations into the setting of the MC-CAVI methodology along the way.We denote by M k , M k N , the k-fold composition of M, M N respectively, for k ≥ 0.
If M(λ) = λ for some λ ∈ , then λ is a fixed point of M().A given λ * ∈ is called an isolated local maximiser of the ELBO(q(•)) if there is a neighborhood of λ * over which λ * is the unique maximiser of the ELBO(q(•)).
Notice that the above assumption refers to the deterministic update M(•), which performs co-ordinate ascent; thus requirements (i), (ii) are fairly weak for such a recursion.The critical technical assumption required for delivering the convergence results in the rest of this section is the following one.
Assumption 4 (Uniform Convergence in Probability on Compact Sets) For any compact set C ⊆ the following holds: for any , > 0, there exists a positive integer N 0 , such that for all N ≥ N 0 we have, It is beyond the context of this paper to examine Assumption 4 in more depth.We will only stress that Assumption 4 is the sufficient structural condition that allows to extend closeness between CAVI and MC-CAVI updates in a single algorithmic step into one for arbitrary number of steps.
We continue with a definition.
We will state the main asymptotic result for MC-CAVI in Theorem 1 that follows; first we require Lemma 1.
The main result of this section is as follows.
Theorem 1 Let Assumptions 1-4 hold and λ * be an isolated local maximiser of ELBO(q(•)).Then there exists a neighbourhood, say V 1 , of λ * such that for starting values λ ∈ V 1 of MC-CAVI algorithm and for all 1 > 0, there exists a k 0 such that The proofs of Lemma 1 and Theorem 1 can be found in "Appendices A and B", respectively.

Stopping criterion and sample size
The method requires the specification of the Monte Carlo size N and a stopping rule.

Principled: but impractical-approach
As the algorithm approaches a local maximum, changes in ELBO should be getting closer to zero.To evaluate the performance of MC-CAVI, one could, in principle, attempt to monitor the evolution of ELBO during the algorithmic iterations.For current variational distribution q = (q 1 , . . ., q b ), assume that MC-CAVI is about to update q i with q i = q i,N , where the addition of the second subscript at this point emphasizes the dependence of the new value for q i on the Monte Carlo size N .Define, If the algorithm is close to a local maximum, ELBO(q, N ) should be close to zero, at least for sufficiently large N .Given such a choice of N , an MC-CAVI recursion should be terminated once ELBO(q, N ) is smaller than a user-specified tolerance threshold.Assume that the random variable ELBO(q, N ) has mean μ = μ(q, N ) and variance σ 2 = σ 2 (q, N ).Chebychev's inequality implies that, with probability greater than or equal to (1 − 1/K 2 ), ELBO(q, N ) lies within the interval (μ − K σ, μ + K σ ), for any real K > 0. Assume that one fixes a large enough K .The choice of N and of a stopping criterion should be based on the requirements: zero, implying that ELBO(q, N ) differs from zero by less than 2K σ .
Requirement (i) provides a rule for the choice of Nassuming applied over all 1 ≤ i ≤ b, for q in areas close to a maximiser,-and requirement (ii) a rule for defining a stopping criterion.Unfortunately, the above considerations-based on the proper term ELBO(q) that VI aims to maximise-involve quantities that are typically impossible to obtain analytically or via some reasonably expensive approximation.

Practical considerations
Similarly to MCEM, it is recommended that N gets increased as the algorithm becomes more stable.It is computationally inefficient to start with a large value of N when the current variational distribution can be far from the maximiser.In practice, one may monitor the convergence of the algorithm by plotting relevant statistics of the variational distribution versus the number of iterations.We can declare that convergence has been reached when such traceplots show relatively small random fluctuations (due to the Monte Carlo variability) around a fixed value.At this point, one may terminate the algorithm or continue with a larger value of N , which will further decrease the traceplot variability.In the applications we encounter in the sequel, we typically have N ≤ 100, so calculating, for instance, Effective Sample Sizes to monitor the mixing performance of the MCMC steps is not practical.

Numerical examples: simulation study
In this section we illustrate MC-CAVI with two simulated examples.First, we apply MC-CAVI and CAVI on a simple model to highlight main features and implementation strategies.Then, we contrast MC-CAVI, MCMC, BBVI in a complex scenario with hard constraints.

Simulated example 1
We generate n = 10 3 data points from N(10, 100) and fit the semi-conjugate Bayesian model Example Model 1 Let x be the data sample mean.In each iteration, the CAVI density function-see (4)-for τ is that of the Gamma distribution Gamma( n+3 2 , ζ ), with whereas for ϑ that of the normal distribution N( ) and E(τ ) denote the relevant expectations under the current CAVI distributions for ϑ and τ respectively; the former are initialized at 0-there is no need to initialise E(τ ) in this case.Convergence of CAVI can be monitored, e.g., via the sequence of values of θ := (1 + n)E(τ ) and ζ .If the change in values of these two parameters is smaller than, say, 0.01%, we declare convergence.Figure 1 shows the traceplots of θ , ζ .
Convergence is reached within 0.0017 s,1 after precisely two iterations, due to the simplicity of the model.The resulted CAVI distribution for ϑ is N(9.6, 0.1), and for τ it is Gamma(501.5,50130.3)so that E(τ ) ≈ 0.01.
Assume now that q(τ ) was intractable.Since E(τ ) is required to update the approximate distribution of ϑ, an MCMC step can be employed to sample τ 1 , . . ., τ N  (E(ϑ), E(ϑ 2 )) are initialised as in CAVI.For the first 10 iterations we set N = 10, and for the remaining ones, N = 10 3 to reduce variability.We monitor the values of E(τ ) shown in Fig. 2. The figure shows that MC-CAVI has stabilized after about 15 iterations; algorithmic time was 0.0114 s.To remove some Monte Carlo variability, the final estimator of E(τ ) is produced by averaging the last 10 values of its traceplot, which gives E(τ ) = 0.01, i.e. a value very close to the one obtained by CAVI.The estimated distribution of ϑ is N(9.6, 0.1), the same as with CAVI.
The performance of MC-CAVI depends critically on the choice N .Let A be the value of N in the burn-in period, B the number of burn-in iterations and C the value of N after burn-in.Figure 3 shows trace plots of E(τ ) under different settings of the triplet A-B-C.
As with MCEM, N should typically be set to a small number at the beginning of the iterations so that the algorithm can reach fast a region of relatively high probability.N should then be increased to reduce algorithmic variability close to the convergence region.Figure 4 shows plots of convergence time versus variance of E(τ ) (left panel) and versus N (right panel).In VI, iterations are typically terminated when the (absolute) change in the monitored estimate is less than a small threshold.In MC-CAVI the estimate fluctuates around the limiting value after convergence (Table 1).In the simulation in Fig. 4, we terminate the iterations when the difference between the estimated mean (disregarding the first half of the chain) and the true value (0.01) is less than 10 −5 .Figure 4 shows that: (i) convergence time decreases when the variance of E(τ ) decreases, as anticipated; (ii) convergence time decreases when N increases.In (ii), the decrease is most evident when N is still relatively small After N exceeds 200, convergence time remains almost fixed, as the benefit brought by decrease of variance is offset by the cost of extra samples.(This is also in agreement with the policy of N set to a small value at the initial iterations of the algorithm.)

Variance reduction for BBVI
In non-trivial applications, the variability of the initial estimator ∇ λ ELBO(q) within BBVI in (6) will typically be large, so variance reduction approaches such as Rao-Blackwellization and control variates (Ranganath et al. 2014) are also used.Rao-Blackwellization (Casella and Robert 1996) reduces variances by analytically calculating conditional expectations.In BBVI, within the factorization framework of (3), where λ = (λ 1 , . . ., λ b ), and recalling identity (5) for the gradient, a Monte Carlo estimator for the gradient with respect to λ i , i ∈ {1, . . ., b}, can be simplified as Depending on the model at hand, term c i (z i , x) can be obtained analytically or via a double Monte Carlo procedure (for estimating c i (z (n) i , x), over all 1 ≤ n ≤ N )-or a combination of thereof.In BBVI, control variates (Ross 2002) can be defined on a per-component basis and be applied to the Rao-Blackwellized noisy gradients of ELBO in (11) to provide the estimator, for the control, , where f i, j , g i, j denote the jth co-ordinate of the vectorvalued functions f i , g i respectively, given below,

Simulated example 2: model with hard constraints
In this section, we discuss the performance and challenges of MC-CAVI, MCMC, BBVI for models where the support of the posterior-thus, also the variational distributioninvolves hard constraints.
Here, we provide an example which offers a simplified version of the NMR problem discussed in Sect. 4 but allows for the implementation of BBVI, as the involved normalising constants can be easily computed.Moreover, as with other gradient-based methods, BBVI requires to tune the step-size sequence {ρ k } in (7), which might be a laborious task, in particular for increasing dimension.Although there are several proposals aimed to optimise the choice of {ρ k } (Bottou 2012; Kucukelbir et al. 2017), MC-CAVI does not face such a tuning requirement.
Notice that we have the full conditional distributions, .
(Above, and in similar expressions written in the sequel, equality is meant to be properly understood as stating that 'the density on the left is equal to the density of the distribution on the right'.)For each ψ j , 1 ≤ j ≤ n, the full conditional is, where φ(•) is the density of N(0, 1) and (•) its cdf.The Metropolis-Hastings proposal for ψ j is a uniform variate from U(0, 2).

MC-CAVI
For MC-CAVI, the logarithm of the joint distribution is given by, log p(y, ϑ, κ, ψ, θ) = const.+ n 2 log θ − )), under the constraints, To comply with the above constraints, we factorise the variational distribution as, Here, for the relevant iteration k, we have, The quantity E k,k−1 ((y j − ϑ − κ j ) 2 ) used in the second line above means that the expectation is considered under ϑ ∼ q k (ϑ) and (independently) κ j ∼ q k−1 (κ j , ψ j ).
Then, MC-CAVI develops as follows: -For j = 1, . . ., n, apply an MCMC algorithm-with invariant law q k−1 (κ j , ψ j )-consisted of a number, N , of Metropolis-within-Gibbs iterations carried out over the relevant full conditionals, As with the full conditional p(ψ j |y, θ, ϑ, κ) within the MCMC sampler, we use a uniform proposal U(0, 2) at the Metropolis-Hastings step applied for q k−1 (ψ j |κ j ).For each k, the N iterations begin from the (κ j , ψ j )-values obtained at the end of the corresponding MCMC iterations at step k − 1, with very first initial values being κ, ψ j ) = (0, 1).Use the N samples to obtain E k−1 (κ j ) and E k−1 (κ 2 j ).-Update the variational distribution for ϑ, and evaluate E k (ϑ), E k (ϑ 2 ).
-Update the variational distribution for θ , and evaluate E k (θ ).

BBVI
For BBVI we assume a variational distribution q(θ, ϑ, κ, ψ | α, γ ) that factorises as in the case of CAVI in ( 13), where to be the variational parameters.Individual marginal distributions are chosen to agree-in type-with the model priors.
In particular, we set, It is straightforward to derive the required gradients (see "Appendix C" for the analytical expressions).BBVI is applied using Rao-Blackwellization and control variates for variance reduction.The algorithm is as follows, • Step 0: Set η = 0.5; initialise α 0 = 0, γ 0 = 0 with the exception α 0 ϑ = 4.
For the learning rate, we employed the AdaGrad algorithm (Duchi et al. 2011) and set , where G k is a matrix equal to the sum of the first k iterations of the outer products of the gradient, and diag(•) maps a matrix to its diagonal version.

Results
The three algorithms have different stopping criteria.We run each for 100 s for parity.A summary of results is given in Table 2. Model fitting plots and algorithmic traceplots are shown in Fig. 5. Table 2 indicates that all three algorithms approximate the posterior mean of ϑ effectively; the estimate from MC-CAVI has smaller variability than the one of BBVI; the opposite holds for the variability in the estimates for θ .Figure 5 shows that the traceplots for BBVI are unstable, a sign that the gradient estimates have high variability.In contrast, MCMC and MC-CAVI perform rather well.Figure 6 shows the 'true' posterior density of ϑ (obtained from an expensive MCMC with 10,000 iterations-5000 burn-in) and the correspond-ing approximation obtained via MC-CAVI.In this case, the variational approximation is quite accurate at the estimation of the mean but underestimates the posterior variance (rather typically for a VI method).We mention that for BBVI we also tried to use normal laws as variational distributions-as this is mainly the standard choice in the literature-however, in this case, the performance of BBVI deteriorated even further.Resonance peaks generated by each compound must be identified in the spectrum after deconvolution.The spectral signature of a compound is given by a combination of peaks not necessarily close to each other.Such compounds can generate hundreds of resonance peaks, many of which overlap.This causes difficulty in peak identification and deconvolution.The analysis of NMR spectrum is further complicated by fluctuations in peak positions among spectra induced by uncontrollable variations in experimental conditions and the chemical properties of the biological samples, e.g. by the pH.Nevertheless, extensive information on the patterns of spectral resonance generated by human metabolites is now available in online databases.By incorporating this information into a Bayesian model, we can deconvolve resonance peaks from a spectrum and obtain explicit concentration estimates for the corresponding metabolites.Spectral resonances that cannot be deconvolved in this way may also be of scientific interest; these are modelled in Astle et al. (2012) using wavelet basis functions.More specifically, an NMR spectrum is a collection of peaks convoluted with various horizontal translations and vertical scalings, with each peak having the form of a Lorentzian curve.A number of metabolites of interest have known NMR spectrum shape, with the height of the peaks or their width in a particular experiment providing information about the abundance of each metabolite.The zero-centred, standardized Lorentzian function is defined as: where γ is the peak width at half height.An example of 1 H NMR spectrum is shown in Fig. 7.The x-axis of the spectrum measures chemical shift in parts per million (ppm) and corresponds to the resonance frequency.The y-axis measures relative resonance intensity.Each spectrum peak corresponds to magnetic nuclei resonating at a particular frequency in the biological mixture, with every metabolite having a characteristic molecular 1 H NMR 'signature'; the result is a convolution of Lorentzian peaks that appear in specific positions in 1 H NMR spectra.Each metabolite in the experiment usually gives rise to more than a 'multiplet' in the spectrumi.e.linear combination of Lorentzian functions, symmetric around a central point.Spectral signature (i.e.pattern multiplets) of many metabolites are stored in public databases.The aim of the analysis is: (i) to deconvolve resonance peak in the spectrum and assign them to a particular metabolite; (ii) estimate the abundance of the catalogued metabolites; (iii) model the component of a spectrum that cannot be assigned to known compounds.Astle et al. (2012) propose a two-component joint model for a spectrum, in which the metabolites whose peaks we wish to assign explicitly are modelled parametrically, using information from the online databases, while the unassigned spectrum is modelled using wavelets.

The model
We now describe the model of Astle et al. (2012).The available data are represented by the pair (x, y), where x is a vector of n ordered points (of the order 10 3 − 10 4 ) on the chemical shift axis-often regularly spaced-and y is the vector the corresponding resonance intensity measurements (scaled, so that they sum up to 1).The conditional law of y|x is modelled under the assumption that y i |x are independent normal variables and, Here, the φ component of the model represents signatures that we wish to assign to target metabolites.The ξ component models signatures of remaining metabolites present in the spectrum, but not explicitly modelled.We refer to this latter as residual spectrum and we highlight the fact that it is important to account for it as it can unveil important information not captured by φ(•).Function φ is constructed parametrically using results from the physical theory of NMR and information available online databases or expert knowledge, while ξ is modelled semiparametrically with wavelets generated by a mother wavelet (symlet 6) that resembles the Lorentzian curve.More analytically, where M is the number of metabolites modelled explicitly and β = (β 1 , . . ., β M ) is a parameter vector corresponding to metabolite concentrations.Function t m (•) represents a continuous template function that specifies the NMR signature of metabolite m and it is defined as, where u is an index running over all multiplets assigned to metabolite m, v is an index representing a peak in a multiplet and V m,u is the number of peaks in multiplet u of metabolite m.In addition, δ * m,u specifies the theoretical position on the chemical shift axis of the centre of mass of the uth multiplet of the mth metabolite; z m,u is a positive quantity, usually equal to the number of protons in a molecule of metabolite m that contributes to the resonance signal of multiplet u; ω m,u,v is the weight determining the relative heights of the peaks of the multiplet; c m,u,v is the translation determining the horizontal offsets of the peaks from the centre of mass of the multiplet.Both ω m,u,v and c m,u,v can be computed by empirical estimates of the so-called J -coupling constants; see Hore (2015) for more details.The z m,u 's and J -coupling constants information can be found in online databases or from expert knowledge.
Finally, overall, the model for an NMR spectrum can be re-written in matrix form as: where W ∈ R n×n 1 is the inverse wavelet transform, M is the total number of known metabolites, T is an n× M matrix with its (i, m)th entry equal to t m (x i ) and θ is a scalar precision parameter.where LN denotes a log-normal distribution and δ * m,u is the estimate for δ * m,u obtained from the online database HMDB (see Wishart et al. 2007Wishart et al. , 2008Wishart et al. , 2012Wishart et al. , 2017)).In the regions of the spectrum where both parametric (i.e.φ) and semiparametric (i.e.ξ ) components need to be fitted, the likelihood is unidentifiable.To tackle this problem, Astle et al. (2012) opt for shrinkage priors for the wavelet coefficients and include a vector of hyperparameters ψ-each component ψ j,k of which corresponds to a wavelet coefficient-to penalize the semiparametric component.To reflect prior knowledge that NMR spectra are usually restricted to the half plane above the chemical shift axis, Astle et al. (2012) introduce a vector of hyperparameters τ , each component of which, τ i , corresponds to a spectral data point, to further penalize spectral reconstructions in which some components of W −1 ϑ are less than a small negative threshold.In conclusion, Astle  (ϑ, ψ, τ, θ), p(ϑ, ψ, τ, θ)

Prior specification
where ψ introduces local shrinkage for the marginal prior of ϑ and τ is a vector of n truncation limits, which bounds W −1 ϑ from below.The truncation imposes an identifiability constraint: without it, when the signature template does not match the shape of the spectral data, the mismatch will be compensated by negative wavelet coefficients, such that an ideal overall model fit is achieved even though the signature template is erroneously assigned and the concentration of metabolites is overestimated.Finally we set c j = 0.05, d j = 10 −8 , h = −0.002,r = 10 5 , a = 10 −9 , e = 10 −6 ; see Astle et al. (2012) for more details.

Results
BATMAN is an R package for estimating metabolite concentrations from NMR spectral data using a specifically designed MCMC algorithm (Hao et al. 2012) to perform posterior inference from the Bayesian model described above.We implement a MC-CAVI version of BATMAN and compare its performance with the original MCMC algorithm.Details of the implementation of MC-CAVI are given in "Appendix D".Due to the complexity of the model and the data size, it is challenging for both algorithms to reach convergence.We run the two methods, MC-CAVI and MCMC, for approximately an equal amount of time, to analyse a full spectrum with 1530 data points and modelling parametrically 10 metabolites.We fix the number of iterations for MC-CAVI to 1000, with a burn-in of 500 iterations; we set the Monte Carlo size to N = 10 for all iterations.The execution time for this MC-CAVI algorithms is 2048 s.For the MCMC algorithm, we fix the number of iterations to 2000, with a burn-in of 1000 iterations.This MCMC algorithm has an execution time of 2098 s.In 1 H NMR analysis, β (the concentration of metabolites in the biofluid) and δ * m,u (the peak positions) are the most important parameters from a scientific point of view.Traceplots of four examples (β 3 , β 4 , β 9 and δ 4,1 ) are shown in Fig. 8.These four parameters are chosen due to the different performance of the two methods, which are closely examined in Fig. 10.For β 3 and β 9 , traceplots are still far from convergence for MCMC, while they move toward the correct direction (see Fig. 8) when using MC-CAVI.For β 4 and δ 4,1 , both parameters reach a stable regime very quickly in MC-CAVI, whereas the same parameters only make local moves when implementing MCMC.For the remaining parameters in the model, both algorithms present similar results.
Figure 9 shows the fit obtained from both the algorithms, while Table 3 reports posterior estimates for β.From Fig. 9, it is evident that the overall performance of MC-CAVI is similar as that of MCMC since in most areas, the metabolites fit (orange line) captures the shape of the original spectrum quite well.Table 3 shows that, similar to standard VI behaviour, MC-CAVI underestimates the variance of the posterior density.We examine in more detail the posterior distribution of the β coefficients for which the posterior means obtained with the two algorithms differ more than 1.0e−4.Figure 10 shows that MC-CAVI manages to capture the shapes of the peaks while MCMC does not, around ppm values of 2.14 and 3.78, which correspond to spectral regions where many peaks overlap making peak deconvolution challenging.This is probably due to the faster convergence of MC-CAVI. Figure 10 shows that for areas with no overlapping (e.g.around ppm values of 2.66 and 7.53), MC-CAVI and MCMC produce similar results.
Comparing MC-CAVI and MCMC's performance in the case of the NMR model, we can draw the following conclusions: • In NMR analysis, if many peaks overlap (see Fig. 10), MC-CAVI can provide better results than MCMC.• In high-dimensional models, where the number of parameters grows with the size of data, MC-CAVI can converge faster than MCMC.• Choice of N is important for optimising the performance of MC-CAVI.Building on results derived for other Monte Carlo methods (e.g.MCEM), it is reasonable to choose a relatively small number of Monte Carlo at the beginning when the algorithm can be far from regions of parameter space of high posterior probability, and gradually increase the number of Monte Carlo iterations, with the maximum number taken once the algorithm has reached a mode.

Discussion
As a combination of VI and MCMC, MC-CAVI provides a powerful inferential tool particularly in high dimensional settings when full posterior inference is computationally demanding and the application of optimization and of noisy-gradient-based approaches, e.g.BBVI, is hindered by the presence of hard constraints.The MCMC step of MC-CAVI is necessary to deal with parameters for which VI approximation distributions are difficult or impossible to derive, for example due to the impossibility to derive closed-form expression for the normalising constant.General Monte Carlo algorithms such as sequential Monte Carlo and Hamiltonian Monte Carlo can be incorporated within MC-CAVI.Compared with MCMC, the VI step of MC-CAVI speeds up convergence and provides reliable estimates in a shorter time.Moreover, MC-CAVI scales better in high-dimensional settings.As an optimization algorithm, MC-CAVI's convergence monitoring is easier than MCMC.Moreover, MC-CAVI offers a flexible alternative to BBVI.This latter algorithm, although very general and suitable for a large range of complex models, depends crucially on the quality of the approximation to the true target provided by the variational distribution, which in high dimensional setting (in particular with hard constraints) is very difficult to assess.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 3 Fig. 4 A
Fig.3Traceplot of E(τ ) under different settings of A-B-C (respectively, the value of N in the burn-in period, the number of burn-in iterations and the value of N after burn-in) for Simulated Example 1

Table 2 Fig. 5
Fig. 5 Model fit (left panel), traceplots of ϑ (middle panel) and traceplots of θ (right panel) for the three algorithms: MCMC (first row), MC-CAVI (second row) and BBVI (third row)-for Example Model 2-when allowed 100 s of execution.In the plots showing model fit,

Fig.
Fig. Density plots for the true posterior of ϑ (blue line)-obtained via an expensive MCMC-and the corresponding approximate distribution provided by MC-CAVI.(Color figure online)

Fig.
Fig. Traceplots of parameter value against number of iterations after the burn-in period for 3 (upper left panel), β 4 (upper right panel), β 9 (lower left panel) and δ 4,1 (lower right panel).The y-axis corresponds to the obtained parameter values (the mean of the distribution q for

Fig. 9
Fig. 9 of MC-CAVI and MCMC in terms of spectral fit.The upper panel shows the Spectral Fit from MC-CAVI The lower panel shows the Spectral Fit from MCMC algorithm.The x-axis corresponds to chemical shift measure in ppm.The y-axis corresponds to standard density