1 Introduction

Gaussian Process (GP) models represent a class of models that are popular in data analysis due to the associated flexibility and interpretability. Both these features are a direct consequence of their rich parameterization. Flexibility is due to the nonparametric prior over latent variables conditioning observations, whereas interpretability is due to the parameterization of the structure associated with the latent variables. Observations are conditionally independent given a set of jointly Gaussian latent variables, and are assumed to be distributed according to the particular type of data being modeled. The covariance structure of the latent variables is then parameterized by a set of hyper-parameters that characterizes the covariance of the input vectors in terms of length-scales and intensity of interaction. GP models comprise a large set of models, and this paper focuses in particular on Logistic Regression with GP priors (LRG) (Rasmussen and Williams 2006), Log-Gaussian Cox models (LCX) (Møller et al. 1998), Stochastic Volatility models with GP priors (VLT) (Wilson and Ghahramani 2010), and Ordinal Regression with GP priors (ORD) (Chu and Ghahramani 2005).

Exact inference in GP models is analytically intractable. Most of the work to tackle such intractability focuses on deterministic approximations to integrate out latent variables; those approaches include the Laplace Approximation (LA) (Tierney and Kadane 1986), Expectation Propagation (EP) (Minka 2001), and mean field approximations (Opper and Winther 2000) (see, e.g., Rasmussen and Williams 2006 for an extensive presentation of such approximations, and Kuss and Rasmussen 2005 for their assessment on LRG models). Deterministic approximations provide a computationally tractable way to integrate out latent variables, but it is not possible to quantify the error that they introduce in the quantification of uncertainty in predictions (although EP for LRG is reported to be very accurate in Kuss and Rasmussen 2005); also, those methods target the integration of latent variables only.

In the direction of providing a fully Bayesian treatment of GP models, it is necessary to integrate out latent variables as well as hyper-parameters, and this is usually done by quadrature methods (Cseke and Heskes 2011; Rue et al. 2009), thus limiting the number of hyper-parameters that can be employed in GP models.

Based on those considerations, this paper focuses on non-deterministic methods to carry out inference in GP models, and in particular on stochastic based approximations based Markov Chain Monte Carlo (MCMC) methods. The use of MCMC based inference methods is appealing as it provides asymptotic guarantees of convergence to exact inference. In practice, this translates into the possibility of achieving results with the desired level of accuracy (Flegal et al. 2007). Unfortunately, the use of MCMC methods for inference in GP models is extremely difficult. The aim of this paper is to discuss the challenges associated with MCMC based inference for GP models, and compare a number of strategies that have been proposed in the literature to tackle them. A preliminary version of this work can be found in Filippone et al. (2012b).Footnote 1

To the best of our knowledge, this work (i) is the first attempt to extensively assess the state-of-the-art in stochastic-based inference methods for GP models, and (ii) sets the bar for new MCMC methods for inference in GP models. Along with those contributions, this paper presents (iii) a variant of the Hybrid Monte Carlo algorithm that outperforms state-of-the-art methods to sample from the posterior distribution of the latent variables, and (iv) tests the combination of parameterizations, as recently proposed in Yu and Meng (2011), in the case of GP models.

1.1 Gaussian Process models

Let X={x 1,…,x n } be a set of n input vectors described by a set of d covariates \(\mathbf {x}_{i} \in \mathbb {R}^{d}\), associated with observed responses y={y 1,…,y n }. In GP models, the generative process modeling the observed data y given X is as follows. Observations are assumed to be conditionally independent given a set of n latent variables f={f 1,…,f n }, and distributed according to a certain distribution depending on the particular type of data, e.g., Bernoulli for binary labels and Poisson for observations in the form of counts. This can be translated into a likelihood function of the form \(p(\mathbf {y}| \mathbf {f}) = \prod_{i=1}^{n} p(y_{i} | f_{i})\), where for generality the distribution p(y i |f i ) is left unspecified.

In this work, latent variables are assumed to be drawn from a zero mean GP prior with covariance function k. The GP prior is a prior over functions, and the covariance structure given by k specifies the characteristics of such functions (i.e., degree of smoothness and marginal variance). Let k be parameterized by a vector of hyper-parameters \(\boldsymbol {\theta }= (\sigma, \psi_{\tau_{1}}, \ldots, \psi_{\tau_{d}})\), and assume:

$$ k(\mathbf {x}_i, \mathbf {x}_j | \boldsymbol {\theta }) = \sigma q(\mathbf {x}_i, \mathbf {x}_j | \boldsymbol {\psi }_{\boldsymbol {\tau }}) = \sigma\exp \Biggl[-\frac{1}{2} \sum _{r=1}^d \frac{ (\mathbf {x}_i - \mathbf {x}_j)_{(r)}^2}{\exp(\psi_{\tau_r})^2} \Biggr] $$

with \(\exp(\psi_{\tau_{r}})\) defining the length-scale of the interaction between the input vectors for the rth covariate and σ giving the marginal variance for latent variables. This type of covariance can be used for Automatic Relevance Determination (ARD) (Mackay 1994) of the covariates, as the values \(\tau_{i} = \exp(\psi_{\tau_{i}})\) can be interpreted as length-scale parameters. This definition of covariance function is adopted in many applications and is the one we will consider in the remainder of this paper. Exponentiation of the hyper-parameters is convenient, so that standard MCMC transition operators can be employed for \(\psi_{\tau_{i}}\) thus avoiding dealing with boundary conditions or non-standard MCMC proposals (Robert and Casella 2005). Let Q be the matrix whose entries are q ij =q(x i ,x j |ψ τ ); the covariance matrix K will then be K=σQ. The model is fully specified by choosing a prior p(θ) for the hyper-parameters. The model structure is therefore hierarchical, with hyper-parameters conditioning the latent variables that, in turn, condition observations, so that p(y,f,θ)=p(y|f)p(f|θ)p(θ).

In a Bayesian setting, the predictive distribution for new input values x can be written in the following way (for the sake of clarity we drop the explicit conditioning on X and x ):

$$ p(y_* | \mathbf {y}) = \iiint p(y_* | f_*) p(f_* | \mathbf {f}, \boldsymbol {\theta }) p(\mathbf {f}, \boldsymbol {\theta }| \mathbf {y}) df_* d\mathbf {f}d\boldsymbol {\theta }$$

The left hand side of Eq. (2) is a full probability distribution characterizing the uncertainty in predicting y given the GP modeling assumption.

In this work we will focus on stochastic approximations for obtaining samples from the posterior distribution of f and θ, so that we can obtain a Monte Carlo estimate of the predictive distribution as follows:

$$ p(y_* | \mathbf {y}) \simeq\frac{1}{N} \sum _{i=1}^N \int p(y_* | f_*) p\bigl(f_* | \mathbf {f}^{(i)}, \boldsymbol {\theta }^{(i)}\bigr) df_* $$

where N denotes the number of samples used to compute the estimate. In Eq. (3) we denoted the ith samples from the posterior distribution of f and θ obtained by means of MCMC methods by f (i) and θ (i). Note that the remaining integral is univariate and it is generally easy to evaluate.

1.2 Challenges in MCMC based inference for GP models

Sampling from the posterior of latent variables and hyper-parameters by joint proposals is not feasible; it is extremely unlikely to propose a set of latent variables and hyper-parameters that are compatible with each other and observed data. This forces one to consider schemes such as Gibbs sampling, where groups of variables are updated one at time, leading to the following challenges:

(i) Due to the hierarchical structure of GP models, chains converge slowly and mix poorly if the coupling effect between the groups of variables is not dealt with properly. This requires some form of reparameterization or clever proposal mechanism that efficiently decouples the dependencies between the groups of variables. This effect has drawn a lot of attention in the case of hierarchical models in general (Yu and Meng 2011), and recently in GP models (Knorr-Held and Rue 2002; Murray and Adams 2010). In Knorr-Held and Rue (2002) a joint update of latent variables and hyper-parameters is proposed with the aim of avoiding proposals for hyper-parameters to be conditioned on the values of latent variables. In Murray and Adams (2010) a parameterization based on auxiliary data is proposed that aims at reducing the coupling between the two groups of variables. Other ideas involve the use of reparameterizations based on whitening the latent variables; in the terminology of Yu and Meng (2011), this corresponds to employing the so called Ancillary Augmentation (AA) parameterization. Recently, Yu and Meng (2011) proposed to interweave parameterizations characterized by complementary features in order to boost sampling efficiency. Parameterizations can be complementary in the sense that they offer better performance in either strong or weak data limits; the idea of combining parameterizations is to achieve high sampling efficiency in both strong and weak data scenarios. We are interested in comparing the methods in Knorr-Held and Rue (2002), Murray and Adams (2010) and Yu and Meng (2011) applied to GP models. Another possibility would be to approximately integrate out latent variables and obtain samples from the corresponding approximate posterior of hyper-parameters. For GP classification this might be a sensible thing to do, as the Expectation Propagation approximation has been reported to be very accurate (Kuss and Rasmussen 2005); however, this is peculiar to GP classification and for general GP models it may not be the case.

(ii) Sampling hyper-parameters and latent variables cannot be done using exact Gibbs steps, and it requires proposals that are accepted/rejected based on a Hastings ratio, leading to a waste of expensive computations. Transition operators characterized by acceptance mechanisms embedded in a Gibbs sampler, are usually referred to as Metropolis-within-Gibbs operators. Designing proposals that guarantee high acceptance and independence between samples is extremely challenging, especially because latent variables can have dimensions in the order of hundreds or thousands. We will compare several transition operators, for different steps of the Gibbs sampler, with the aim of gaining insights about ways to strike a good balance between efficiency and computational cost. We will consider transition operators characterized by proposal mechanisms with increasing complexity, and in particular the Metropolis-Hastings (MH) operator which is based on random walk types of proposals, Hybrid Monte Carlo (HMC) which uses the gradient of the log-density of interest, and manifold methods (Girolami and Calderhead 2011) which use curvature information (i.e., second derivatives of the log-density).

The paper is organized as follows. Sections 2 and 3 report the parameterization strategies and the transition operators considered in this work. Sections 4 and 5 report an extensive comparison of those strategies and transition operators, on simulated and real data, on the basis of efficiency, convergence speed and computational complexity; Sect. 6 concludes the paper. For the sake of readability, most of the technical derivations can be found in the appendices.

2 Dealing with the hierarchical structure of GP models

2.1 Sufficient and ancillary augmentation

From a generative perspective, the model structure is hierarchical with latent variables representing sufficient statistics for the hyper-parameters. This parameterization is referred to as Sufficient Augmentation (SA) in Yu and Meng (2011) and allows one to express the joint density as

$$ \mathrm{SA} \quad p(\mathbf {y}, \mathbf {f}, \boldsymbol {\theta }) = p(\mathbf {y}| \mathbf {f}) p(\mathbf {f}| \boldsymbol {\theta }) p(\boldsymbol {\theta }) $$

It is also possible to introduce the decomposition of the matrix Q into the product of two factors LL T, and view the generation of the latent variables as \(\mathbf {f}= \sqrt{\sigma} L \boldsymbol {\nu }\) with \(\boldsymbol {\nu }\sim \mathcal {N}(\boldsymbol {\nu }| \mathbf {0}, I)\), which implies that \(\mathbf {f}\sim \mathcal {N}(\mathbf {f}| \mathbf{0}, K)\). In the remainder of this paper, we will consider L to be the lower triangular Cholesky decomposition of K, but in principle any square root of K could be used. In this way, ν is ancillary for θ and it is possible to express the joint density as

$$ \mathrm{AA} \quad p(\mathbf {y}, \boldsymbol {\nu }, \boldsymbol {\theta }) = p(\mathbf {y}| \boldsymbol {\nu }, \boldsymbol {\theta }) p(\boldsymbol {\nu }) p(\boldsymbol {\theta }) $$

This parameterization is called Ancillary Augmentation (AA) in the terminology of Yu and Meng (2011). In Murray and Adams (2010) SA and AA are referred to as unwhitened and whitened parameterizations respectively. Weak and strong data limits can influence the efficiency in sampling using either parameterization. For this reason, it is important to choose an efficient parameterizations for the particular problem under study and for the available amount of data, as both these aspects can dramatically influence efficiency and convergence speed of the chains.

2.2 Ancillarity-Sufficiency Interweaving Strategy (ASIS)

In this section we briefly review the main results presented in Yu and Meng (2011) on the combination of parameterizations to improve convergence and efficiency of MCMC methods, and we will illustrate how these results can be applied to GP models. Intuitively, combining parameterizations seems promising to take the best from them in both weak and strong data limits, or at least, to avoid the possibility that chains do not converge because of the wrong choice of parameterization. Alternating the sampling in the SA and AA parameterizations is the most obvious way of combining the two parameterizations, but as recently investigated in Yu and Meng (2011), interweaving SA and AA is actually a more promising way forward. From a theoretical perspective, the geometric rate of convergence r of the scheme when the parameterizations are interweaved, is related to the rates of the two schemes r 1 and r 2 by \(r \leq R_{1,2} \sqrt{r_{1} r_{2}}\), where R 1,2 is the maximal correlation between the latent variables for the two schemes. Given that the former expression implies r≤max(r 1,r 2), combining the two parameterizations leads to a scheme that is better than the worst. This is already an advantage compared to using a single scheme when one is in doubt on which scheme to use. However, the key result is the fact that R 1,2 can be very small depending on the two parameterizations, so it is possible to make the combined scheme converge quickly even if neither of the individual schemes do. In general, this result is quite remarkable, as once different reparameterizations are available, combining them using the interweaving strategy is simple to implement, and can dramatically boost sampling efficiency. In GP models, the ASIS scheme amounts to interweaving SA and AA updates, that following Yu and Meng (2011) yields:

$$ \mathbf {f}| \mathbf {y}, \boldsymbol {\theta }\quad\longrightarrow\quad \boldsymbol {\theta }| \mathbf {f}\quad \longrightarrow\quad \boldsymbol {\nu }= \sigma^{-1/2} L^{-1} \mathbf {f}\quad \longrightarrow\quad \boldsymbol {\theta }| \mathbf {y}, \boldsymbol {\nu }$$

2.3 Knorr-Held and Rue (KHR)

The idea underpinning KHR, is to jointly sample parameters and latent variables as follows. Firstly, a set of hyper-parameters θ′|θ is proposed and secondly a set of latent variables conditioned on the new set of hyper-parameters, namely f′|y,θ′, is proposed. The proposal (θ′,f′) is then jointly accepted or rejected according to a standard Hastings ratio. The key idea is to avoid making the proposal θ′ accepted on the basis of f to avoid the strong coupling effect due to the hierarchical nature of the model. KHR was proposed in applications making use of Gaussian Markov Random Fields, and we will discuss the application of this idea for GP models in the section reporting the experiments. In order to avoid difficulties in devising a proposal for sampling from f′|y,θ′, here we set the proposal as the Gaussian obtained by constructing a Laplace approximation to p(f|y,θ′).

2.4 Surrogate method (SURR)

In the SURR method (Murray and Adams 2010), a set of auxiliary latent variables g is introduced as a noisy version of f; in particular, \(p(\mathbf {g}| \mathbf {f}, \boldsymbol {\theta }) = \mathcal {N}(\mathbf {g}| \mathbf {f}, S_{\boldsymbol {\theta }})\). This construction yields a conditional distribution for f of the form \(p(\mathbf {f}| \mathbf {g}, \boldsymbol {\theta }) = \mathcal {N}(\mathbf {f}| \mathbf {m}, R)\), with R=S θ S θ (S θ +K)−1 S θ and \(\mathbf {m}= R S_{\boldsymbol {\theta }}^{-1} \mathbf {g}\). After decomposing R=DD T, the sampling of θ is then conditioned on the variables η defined as f=D η+m. The covariance S θ is constructed to be diagonal with elements obtained by matching the posterior for each latent variable individually or by Taylor approximations (see Murray and Adams 2010 for details).

3 MCMC transition operators considered in this work

This section presents the transition operators considered in this work. We are interested in understanding whether and to what extent employing proposal mechanisms making use of gradient or curvature information of the target density improves sampling efficiency and speed of convergence with respect to computational complexity. We therefore consider transition operators with increasing complexity, and in particular the Metropolis-Hastings (MH) operator which is based on random walk types of proposals, the Hybrid Monte Carlo (HMC) operator which uses gradient information, and the Simplified Manifold Metropolis Adjusted Langevin Algorithm (SMMALA) operator which is one of the simplest manifold MCMC methods proposed in Girolami and Calderhead (2011) using curvature information.

For the sake of clarity, we will focus on the transitions operators for f, but the same operators can be easily applied to θ. We will first present MH, HMC, and SMMALA, and we will then discuss Elliptical Slice Sampling and a few variants of MH and HMC that have been specifically proposed for sampling f, and do not have counterparts for θ. In the case of latent variables, the operators aim to leave the posterior p(f|y,θ) invariant; in the remainder of this work, W(f) is defined as log[p(y|f)p(f|θ)], which equals the log of the desired target density up to constants. In the case of hyper-parameters we can define the invariant distribution according to the chosen parameterization and apply the operators presented here for sampling θ rather than f.

3.1 Metropolis-Hastings (MH)

The Metropolis-Hastings transition operator employs a proposal mechanism g(f′|f) based on a random walk (Robert and Casella 2005). A common choice is to use a multivariate Gaussian proposal with covariance Σ centered at the former position f, thus taking the form \(g(\mathbf {f}^{\prime}|\mathbf {f}) = \mathcal{N}(\mathbf {f}^{\prime} | \mathbf {f},\varSigma)\). For such a symmetric proposal mechanism, f′ is then accepted with probability min{1,exp(W(f′)−W(f))}.

3.2 Hybrid Monte Carlo (HMC)

In Hybrid Monte Carlo (HMC) the proposals are based on the analogy of a physical system, where a particle is simulated moving in a potential field (Neal 1993). An auxiliary variable p, that plays the role of a momentum variable, is drawn from \(\mathcal {N}(\mathbf {p}| \mathbf {0}, M)\), where the covariance matrix M is the so called mass matrix. The joint density of f and p factorizes as p(f,p)=exp(W(f))p(p), and the negative log-joint density reads

$$ H(\mathbf {f}, \mathbf {p}) = -W(\mathbf {f}) + \frac{1}{2} \log\bigl(|M|\bigr) + \frac {1}{2} \mathbf {p}^{\mathrm {T}} M^{-1} \mathbf {p}+ \mathrm{const.} $$

This is the Hamiltonian of the simulated particle, where the potential field is given by −W(f) and the kinetic energy by the quadratic form in p. In order to draw proposals from p(f|y,θ), we can simulate the particle for a certain time interval, introducing an analogous of time t and solving Hamilton’s equations

$$ \frac{d\mathbf {f}}{dt} = \frac{\partial H}{\partial \mathbf {p}} = M^{-1} \mathbf {p},\qquad \frac{d\mathbf {p}}{dt} = - \frac{\partial H}{\partial \mathbf {f}} = \nabla _{\mathbf {f}} W $$

Given that there is no friction, the energy will be conserved during the motion of the particle. Solving Hamilton’s equations directly for general potential fields, however, is analytically intractable, and therefore it is necessary to resort to schemes where time is discretized. The leapfrog integrator discretizes the dynamics in λ steps, also known as leapfrog steps, and is volume preserving and reversible (see Neal 1993 for details). The leapfrog integrator yields an update of (f,p) into (f (λ),p (λ)). The discretization introduces an approximation such that the total energy is not conserved, so a Metropolis accept/reject step of the form min{1,exp(−H(f (λ),p (λ))+H(f,p))} is needed to ensure that HMC samples from the correct invariant distribution. The HMC transition operator is reported in Algorithm 1.

Algorithm 1
figure 1

HMC transition operator when \(M = L_{M} L_{M}^{\mathrm {T}}\)

3.3 Manifold MCMC: Simplified Manifold MALA (SMMALA)

Manifold MCMC methods (Girolami and Calderhead 2011) were proposed to have an automatic mechanism to tune parameters in MALA and HMC, and are based on the use of curvature through the Fisher Information (FI) matrix. The FI matrix and the Christoffel symbols are the key quantities in information geometry as they characterize the curvature and the connection on the statistical manifold respectively. Consider a statistical model S={p(y|ψ)|ψΨ} where y denotes observed variables and ψ comprises all model parameters. Under conditions that are generally satisfied for most commonly used models (Amari and Nagaoka 2000), S can be considered a C manifold, and is called statistical manifold. Let \(\mathcal{L} = \log[p(\mathbf {y}| \boldsymbol {\psi })]\); the FI matrix G of S at ψ is defined as:

$$ G(\boldsymbol {\psi }) = \mathrm {E}_{p(\mathbf {y}| \boldsymbol {\psi })} \bigl[ (\nabla _{\boldsymbol {\psi }} \mathcal{L} ) ( \nabla_{\boldsymbol {\psi }} \mathcal {L} )^{\mathrm {T}} \bigr] = - \mathrm {E}_{p(\mathbf {y}| \boldsymbol {\psi })} [ \nabla _{\boldsymbol {\psi }} \nabla_{\boldsymbol {\psi }} \mathcal{L} ] $$

By definition, the FI matrix is positive semidefinite, and can be considered as the natural metric on S.

In the case of GP models that are hierarchical we need to consider the statistical manifolds associated with the two levels of the hierarchy separately. Let’s focus on the statistical manifold associated with the model for y given f. The manifold MALA (MMALA) algorithm (Girolami and Calderhead 2011) defines a Langevin diffusion with stationary distribution p(f|θ,y) on the Riemann manifold of density functions, characterized by a metric tensor denoted as G f,f . By employing a first order Euler integrator to solve the diffusion, a proposal mechanism with density \(g(\mathbf {f}^{\prime} | \mathbf {f}) = \mathcal {N}(\mathbf {f}^{\prime} | \boldsymbol {\mu }(\mathbf {f},\epsilon), \epsilon^{2} G_{\mathbf {f},\mathbf {f}}^{-1})\) is obtained, where ϵ is the integration step size, a parameter which needs to be tuned, and the dth component of the mean function μ(f,ϵ) d is

$$\begin{aligned} \boldsymbol {\mu }(\mathbf {f},\epsilon)_d = & \mathbf {f}_d + \frac{\epsilon ^2}{2} \bigl(G_{\mathbf {f},\mathbf {f}}^{-1} \nabla_{\mathbf {f}} W( \mathbf {f}) \bigr)_d - \epsilon^2 \sum _{i=1}^n\sum_{j=1}^n \bigl(G_{\mathbf {f},\mathbf {f}}^{-1}\bigr)_{i,j} \varGamma_{i,j}^d \end{aligned}$$

where \(\varGamma_{i,j}^{d}\) are the Christoffel symbols of the metric in local coordinates (Amari and Nagaoka 2000). Similarly to MALA (Roberts and Stramer 2002), due to the discretization error introduced by the first order approximation, convergence to the stationary distribution is not guaranteed anymore and thus a standard Metropolis accept/reject step is employed to correct this bias.

In the same spirit, it is possible to extend HMC to define Hamilton’s equations on the statistical manifold. This was proposed and applied in Girolami and Calderhead (2011) and called Riemann manifold Hamiltonian Monte Carlo (RM-HMC). In this work, we will not consider RM-HMC or MMALA, as they both require the derivatives of the FI matrix that would require several expensive operations. Instead, we will consider a simplified version of MMALA (SMMALA), where we assume a manifold with constant curvature, that effectively removes the term depending on the Christoffel symbols, so that the mean of the proposal of SMMALA becomes

$$ \boldsymbol {\mu }_{\mathrm{s}}(\mathbf {f},\epsilon) = \mathbf {f}+ \frac{\epsilon ^2}{2} G_{\mathbf {f},\mathbf {f}}^{-1} \nabla_{\mathbf {f}} W(\mathbf {f}) $$

Furthermore, in the last subsection of this section we will present two variants of HMC that bear some similarities with RM-HMC but are computationally cheaper. The SMMALA transition operator is sketched in Algorithm 2.

Algorithm 2
figure 2

SMMALA transition operator

3.4 Elliptical Slice Sampling (ELL-SS)

Elliptical Slice Sampling (ELL-SS) has been proposed in Murray et al. (2010) to draw samples for f in GP models, and is based on slice sampling (Neal 2003). Due to the fact that latent variables are Gaussian, it is possible to derive this particular version of slice sampling, when constrained on an ellipse. For completeness, we report the transition operator in Algorithm 3 and we refer the reader to Murray et al. (2010) for further details. Note that ELL-SS is quite appealing as it returns a sample which does not need to be accepted or rejected (in fact, a rejection mechanism is implicit within step 5), and the proposal mechanism does not have any free parameters that need tuning.

Algorithm 3
figure 3

ELL-SS transition operator

3.5 Scaled versions of MH: MH v1 and MH v2

Due to the strong correlation of latent variables imposed by the GP prior, employing a MH operator with an isotropic covariance to sample latent variables leads to extremely poor efficiency. In order to overcome this problem, Neal (1999) proposed two versions of MH that we will denote by MH v1 and MH v2. In MH v1, a set of latent variables z is drawn from the GP prior \(\mathbf {z}\sim \mathcal {N}(\mathbf {z}| 0, K)\), and the proposal is constructed as follows:

$$ \mathbf {f}^{\prime} = \mathbf {f}+ \alpha \mathbf {z}$$

where the parameter α controls the degree of update. In MH v2, instead, the proposal is as follows:

$$ \mathbf {f}^{\prime} = \sqrt{1 - \alpha^2} \mathbf {f}+ \alpha \mathbf {z}$$

In the latter case, given that the proposal satisfies detailed balance with respect to the prior, the acceptance has to be based on the likelihood alone.

3.6 Scaled versions of HMC: HMC v1 and HMC v2

By a similar argument as in MH, it is possible to introduce scaled versions of HMC that reduce the correlation between latent variables. This can be done by setting the mass matrix of HMC according to the precision of the posterior distribution of latent variables. Similarly, from an information geometric perspective, it is sensible to whiten latent variables according to the metric tensor of the statistical manifold. We notice that the metric tensor associated to the model for y given f is K −1 plus a diagonal matrix which is a function of f (see Appendix A for full details). Whitening with respect to that metric tensor would be computationally very expensive for GP models, as it would require the simulation of the Hamiltonian dynamics on a manifold with a position-dependent curvature; this is implemented by RM-HMC which requires the derivatives of the metric tensor as well as implicit leapfrog iterations (Girolami and Calderhead 2011). In order to reduce the computational cost, we propose the following two options: (i) to approximate the diagonal term to be independent of f so that M −1=(K −1+C)−1=C −1C −1(K+C −1)−1 C −1 with C diagonal and independent of f; we call this variant HMC v1; (ii) to ignore the diagonal part of the metric tensor and set M −1=K; we call this variant HMC v2. In HMC v1, one simple way to make C independent of f is to compute it for the GP prior mean (which is zero), as proposed, e.g., in Christensen et al. (2005), Vanhatalo and Vehtari (2007).

In both cases, it is possible to employ a standard and computationally efficient HMC proposal that captures part of the curvature of the statistical manifold. This is achieved by introducing a variant of HMC that, rather than using the Cholesky decomposition of the mass matrix, uses the decomposition of its inverse. We report this variant of the HMC transition operator in Algorithm 4.

Algorithm 4
figure 4

HMC transition operator when \(M^{-1} = L_{M^{-1}} L_{M^{-1}}^{\mathrm {T}}\)

In HMC v1, employing this formulation of HMC is convenient as computing the inverse of M is more stable than computing M=K −1+C, that requires a potentially unstable inversion of K. HMC v1 requires the computation of the inverse of the mass matrix and its factorization each time a new value of θ is proposed. In HMC v2, instead, no extra operations in O(n 3) are required given that K is already factorized, thus making it computationally very convenient.

4 Results on simulated data

In this section, we first report a study on the efficiency and speed of convergence of different transition operators in sampling from posterior distribution of individual groups of variables in the SA and AA parameterization. Secondly, we report the same analysis to compare different parameterizations to obtain samples from the joint posterior distribution of f and θ.

4.1 Experimental setup

We simulated data from the four GP models considered in this work, namely: LRG, LCX, VLT, and ORD. We generated 10 data sets simulating from each of the four models for all combinations of n=100,400, and d=2,10, for a total of 160 distinct data sets. In order to isolate the effect of different likelihood functions in the results, we seeded the generation of the input data matrix X, hyper-parameters, and latent variables so that these were the same across different models. Covariates were generated uniformly in the unit hyper-cube, and the parameters used to generate latent variables were σ=exp(2), \(\psi_{\tau_{i}} \sim U[-3, -1]\). We imposed Gamma priors on the length-scale parameters with shape a and rate b, p(τ i )=Gam(τ i |a=1,b=1). We imposed an inverse Gamma prior p(σ)=invGam(σ|a=1,b=1), where a and b are shape and scale parameters respectively on σ to exploit conjugacy in the SA parameterization.

In all the experiments we collected 20000 samples after a burn-in phase of 5000 iterations; during the burn-in we also had an adaptive phase to allow the samplers reach recommended acceptance rates (for example around 25 % for MH). The transition operators for f had the following tuning parameters: α for MH v1 and MH v2, and ε for SMMALA and the variants of HMC which used a maximum of 10 leapfrog steps. The transition operators for θ employed the following proposals: MH used a covariance Σ=αI, HMC used a mass matrix M=αI and 10 maximum leapfrog steps, and SMMALA used a step-size ε. Convergence analysis was performed using the \(\hat{R}\) potential scale reduction factor (Gelman and Rubin 1992), which is a classic score used to assess convergence of MCMC algorithms. The computation of the \(\hat{R}\) value is based on the within and between chain variances; a value close to one indicates that convergence is reached. The \(\hat{R}\) value was computed based on 10 chains initialized from the prior to study what efficiency can be achieved without running preliminary simulations; this is different from the initialization procedure suggested in Gelman and Rubin (1992) that requires locating the modes of the target density. Due to the fairly diffuse priors on the length-scale parameters, we noticed difficulty in achieving convergence in some cases; we therefore initialized \(\psi_{\tau_{i}}\) randomly in the interval [−3,−1]. The value of \(\hat{R}\) was checked at 1000, 2000, 5000, 10000, 20000 iterations. We use the following procedure to compactly visualize the speed of convergence; we threshold the median value of \(\hat{R}\) across 10 data sets at each checkpoint and use the following visual coding to report speed of convergence: , so that indicates that \(\hat{R} < 1.1\), indicates that \(1.1 < \hat{R} < 1.3\), and so on. We then stack the rectangles associated to each checkpoint where we computed the value of \(\hat{R}\), thus producing a sort of histogram of the median of \(\hat{R}\) over the iterations. Efficiency of MCMC methods is compared based on the minimum of the Effective Sample Size (ESS) (Robert and Casella 2005) computed across all the sampled variables. We then report its mean and standard deviation across the 10 chains and the 10 different data sets for each combination of size of the data set, dimensionality, and type of likelihood.

We are also interested in statistically assessing which methods achieve faster convergence. In order to do so, we perform pairwise Mann-Whitney tests with significance level of 0.05 comparing the value of \(\hat{R}\) at the last checkpoint for all the chains across 10 data sets. This allows us to obtain an ordering of methods in terms of convergence speed. In each table we include a row at the bottom reporting the result of such a test. We denote by 1|2 situations where the method in row 1 of the corresponding table converges significantly faster than the method in row 2. Instead, the notation 1,2 is used when the method in row 1 does not converge significantly faster than the method in row 2.

As a measure of complexity, we counted the number of operations with complexity in O(n 3), namely number of Cholesky factorizations of n×n matrices (#C), number of inversions of n×n matrices (#I),Footnote 2 and number of multiplications of n×n matrices (#M). We believe that this is a more reliable measure of complexity with respect to running time, as running time can be affected by several implementation details and other factors that are not directly related to the actual complexity of the algorithms.

4.2 Assessing the efficiency of samplers for individual groups of variables

In this section, we present an assessment of the efficiency of different transition operators for each group of variables using both SA and AA parameterizations. Computational complexity for all the operators considered in the next sections is summarized in Table 1, where T represents the number of iterations, d the number of covariates and \(\bar{\lambda}\) the average number of leapfrog steps in HMC transition operators. In the following sub-sections we present results about the sampling of the latent variables and hyper-parameters separately.

Table 1 Breakdown of the number of operations in O(n 3) required to apply the transition operators considered in this work. #M, #I and #C represent number of multiplication of n×n matrices, inversions of n×n matrices, and number of Cholesky decompositions respectively. Counts are reported as functions of the number of iterations T and number of covariates d. In HMC, \(\bar{\lambda}\) denotes the average number of leapfrog steps in one iteration

4.2.1 Sampling f|y,θ

In this section we focus on the sampling from the posterior distribution of the latent variables f. The results can be found in Table 2, and they were obtained by fixing θ to the values used to generate the data. We notice that different likelihood functions heavily affect efficiency and convergence speed; in the examples considered here, the results show that in LRG it is possible to achieve efficiency one order of magnitude higher than in other models. The scaled versions of MH work well in the case of LRG (MH v1 is slightly better than MH v2), but do not offer guarantees of convergence on other models. ELL-SS achieves better efficiency and convergence than the scaled versions of MH. SMMALA, which uses gradient and curvature information, achieves good efficiency and faster convergence than MH v1, MH v2, and ELL-SS, but at the cost of one operation in O(n 3) at each iteration, as the metric tensor is a function of f and needs to be factorized at each iteration. Overall, the results suggest that the scaled versions of HMC are the best sampling methods for f|θ,y. HMC v1 is slightly better than HMC v2, but it requires one extra inversion and one extra Cholesky decomposition compared to HMC v2 that does not require any operations in O(n 3) once the covariance matrix of the GP is factorized.

Table 2 Comparison of transition operators to sample f|y,θ for data generated from models with four different likelihoods. Minimum ESS is averaged over 10 chains for 10 different data sets for each value of n and d. The last row in each sub-table reports the result of the statistical test to assess which operators achieve significantly faster convergence

4.2.2 SA parameterization—sampling θ|f

In this section we present results about the sampling of hyper-parameters from the posterior distribution θ|f,y which, given the hierarchical structure of the model, is simply θ|f independent from the data model. As reported in Table 1, the complexity of applying SMMALA and HMC is quite high compared to MH. MH requires one Cholesky factorization of Q at each iteration. In HMC, at each leapfrog step, the gradients of Q with respect to θ are needed and the cheapest way to do this is by inverting Q first and noticing that all the remaining operations are in O(n 2); this is done \(\bar{\lambda}\) times on average at every iteration of HMC. Similarly, in SMMALA the gradient can be computed by inverting Q first; by doing so, the metric tensor can then be computed by d multiplications with the derivatives of Q and no other O(n 3) operations.

The results are reported in Table 3, and were obtained by fixing f to the value used to generate the data and sampling only the length-scale parameters, as σ can be efficiently sampled using exact Gibbs steps. HMC improves quite substantially on efficiency, but not on speed of convergence; it may be worth employing some rescaling of the hyper-parameters to improve on this as suggested by Neal (1996). The performance of SMMALA is highly variable in efficiency and it converges more slowly than MH and HMC. This might be due to the skewness of the target distribution, that is known to affect the efficiency of SMMALA (Stathopoulos and Filippone 2011). The results indicate that MH strikes a good balance between efficiency and computational cost.

Table 3 Comparison of transition operators to sample θ|f. Minimum ESS is averaged over 10 chains for 10 different data sets for each value of n and d. The last row reports the result of the statistical test to assess which operators achieve significantly faster convergence

4.2.3 AA parameterization—sampling θ|y,ν

In this section we present the sampling of the hyper-parameters from the posterior distribution θ|y,ν, where we fixed ν to the values used to generate the data. The analysis of complexity shows that MH requires one Cholesky factorization at each iteration. In HMC, each leapfrog requires computing L and the gradient of L with respect to θ and no other operations in O(n 3); this can be computed using the differentiation of the Cholesky algorithm which requires d operations in O(n 3) (Smith 1995). Likewise, for SMMALA L and the d derivatives of L with respect to θ are the only operations in O(n 3) needed.

The results can be found in Table 4 and are again variable across different models. In general SMMALA and HMC do not seem to offer faster convergence with respect to the MH transition operator which is therefore competitive in terms of efficiency relative to computational cost.

Table 4 Comparison of transition operators to sample θ|y,ν for data generated from models with four different likelihoods. Minimum ESS is computed as the average over 10 chains for 10 different data sets for each value of n and d. The last row in each sub-table reports the result of the statistical test to assess which operators achieve significantly faster convergence

4.3 Assessing the efficiency of different parameterizations

After analyzing the results in the previous section, we decided to combine the transition operators which achieved a good sampling efficiency with relatively low computational cost and ease of implementation. We decided that a good combination to be used in AA, SA, ASIS, and SURR could be as follows: sampling f using HMC v2 and θ using MH; HMC v2 and MH where adapted during the burn-in phase and in HMC v2 we set the maximum number of leapfrog steps to 10. For the sake of brevity, we focus on the LRG model only; the results on efficiency and speed of convergence in sampling hyper-parameters are reported in Table 5.

Table 5 Comparison of different strategies to sample f,θ|y for data generated from a LRG model. The rightmost column reports the complexity of the different methods with respect to number of inversion and Cholesky decompositions. In KHR, \(\bar{\kappa}\) represents the average number of iterations to run the Laplace Approximation

It is striking to see how challenging it is to efficiently sample from the posterior distribution of latent variables and hyper-parameters. Sampling efficiency is generally low; this is consistent with our experience in other applications involving sampling in hierarchical models (Filippone et al. 2012a). As expected, the SA parameterization is the worst among the ones we tested. The AA parameterization, ASIS, and SURR generally offer good guarantees of convergence within a few thousand iterations. SURR seems to be superior in efficiency, which is consistent with what reported in Murray and Adams (2010), but it requires more operations in O(n 3) compared to AA and ASIS. ASIS slightly improves efficiency and speed of convergence with respect to the AA scheme but requires double the number of operations in O(n 3). KHR seems effective in breaking the correlation between the two groups of variables, but it may require several iterations within the approximation used to sample f. In the experiments considered here \(\bar{\kappa}\) is around 8, so the best compromise between computations and efficiency seems to be given by the AA and ASIS parameterizations.

5 Results on real data

We repeated the comparison of different parameterizations on four UCI data sets (Asuncion and Newman 2007), namely the Pima, Wisconsin, SPECT, and Ionosphere data sets, which we modeled using LRG models; the results are reported in Table 6. We used the same priors and experimental setup as in the previous sections, except that all features were transformed to have zero mean and unit standard deviation, and latent variables were sampled iterating five updates of HMC v2. Also, chains were initialized sampling from the prior. Again, the SA parameterization shows the poorest efficiency and convergence speed, and the AA parameterization improves on that. Combining the AA and SA parameterizations using ASIS slightly improves on the AA parameterization, although the improvement is not dramatic. The SURR method improves on the AA parameterization, which is consistent with what reported in Murray and Adams (2010). The results of KHR are highly variable across data sets; in cases where the approximation to sample latent variables is accurate, the chains mix well. In some cases, however, the approximation is not accurate enough to guarantee a good acceptance rate, and the chains can spend a long time in the same position before accepting the joint proposal.

Table 6 Comparison of different strategies to sample f,θ|y in four UCI data sets modeled using a LRG model

6 Conclusions

In this paper we studied and compared a number of state-of-the-art strategies to carry out the fully Bayesian treatment of GP models. We focused on four GP models and performed an extensive evaluation of efficiency, convergence speed, and computational complexity of several transition operators and sampling strategies.

The results in this paper show that latent variables can be sampled quite efficiently with little computational effort once the GP covariance matrix is factorized. This can be achieved by a simple variant of HMC that we introduced in this paper. About sampling hyper-parameters in different parameterizations, the results presented here indicate that the gain in sampling efficiency given by the use of complicated proposal mechanisms does not scale as much as their computational cost. It would be interesting to investigate some recently proposed variants to slice sampling (Thompson and Neal 2010) and Hybrid Monte Carlo (Hoffman and Gelman 2012) on the sampling of hyper-parameters.

The analysis of the results obtained by different parameterization suggest that AA is a sensible and computationally cheap parameterization with good convergence properties. AA performs similarly to ASIS at half the computational cost. It makes sense, however, to employ ASIS when in doubt about the best parameterization to use, although GP models with full covariance matrices will generally fall into the weak data limit as the O(n 2) space and O(n 3) time complexities constrain the number of data that can be processed.

In general, the results show how challenging it is to efficiently sample from the posterior distribution of latent variables and hyper-parameters in GP models and motivates further research into methods to do this efficiently. Some sampling strategies, such as the one based on the AA parameterization, are capable of achieving convergence within a reasonable number of iterations, and this makes it possible to carry out the fully Bayesian treatment of GP models dealing with a small to moderate number of samples. We have recently demonstrated that this is indeed the case in Filippone et al. (2012a), but more needs to be done in the direction of developing robust stochastic based inference methods for GP models.

It would be interesting to investigate how performance is affected by the choice of the design, which in the simulated data presented here was assumed uniform. Also, we studied in particular GP models with the squared exponential ARD covariance function. It would be interesting to compare the method considered here in models characterized by other covariance functions, such as the Matérn, or sparse inverse covariance functions as in Rue et al. (2009); the latter would make it possible to test the strong data limit case. Finally, in this study we have not included a mean function for the GP prior or extra parameters for the likelihood function. This would require including the sampling of other quantities that may further impact on efficiency and speed of convergence.