# A comparative evaluation of stochastic-based inference methods for Gaussian process models

- 805 Downloads
- 11 Citations

## Abstract

Gaussian Process (GP) models are extensively used in data analysis given their flexible modeling capabilities and interpretability. The fully Bayesian treatment of GP models is analytically intractable, and therefore it is necessary to resort to either deterministic or stochastic approximations. This paper focuses on stochastic-based inference techniques. After discussing the challenges associated with the fully Bayesian treatment of GP models, a number of inference strategies based on Markov chain Monte Carlo methods are presented and rigorously assessed. In particular, strategies based on efficient parameterizations and efficient proposal mechanisms are extensively compared on simulated and real data on the basis of convergence speed, sampling efficiency, and computational cost.

## Keywords

Bayesian inference Gaussian processes Markov chain Monte Carlo Hierarchical models Latent variable models## 1 Introduction

Gaussian Process (GP) models represent a class of models that are popular in data analysis due to the associated flexibility and interpretability. Both these features are a direct consequence of their rich parameterization. Flexibility is due to the nonparametric prior over latent variables conditioning observations, whereas interpretability is due to the parameterization of the structure associated with the latent variables. Observations are conditionally independent given a set of jointly Gaussian latent variables, and are assumed to be distributed according to the particular type of data being modeled. The covariance structure of the latent variables is then parameterized by a set of hyper-parameters that characterizes the covariance of the input vectors in terms of length-scales and intensity of interaction. GP models comprise a large set of models, and this paper focuses in particular on Logistic Regression with GP priors (LRG) (Rasmussen and Williams 2006), Log-Gaussian Cox models (LCX) (Møller et al. 1998), Stochastic Volatility models with GP priors (VLT) (Wilson and Ghahramani 2010), and Ordinal Regression with GP priors (ORD) (Chu and Ghahramani 2005).

Exact inference in GP models is analytically intractable. Most of the work to tackle such intractability focuses on deterministic approximations to integrate out latent variables; those approaches include the Laplace Approximation (LA) (Tierney and Kadane 1986), Expectation Propagation (EP) (Minka 2001), and mean field approximations (Opper and Winther 2000) (see, e.g., Rasmussen and Williams 2006 for an extensive presentation of such approximations, and Kuss and Rasmussen 2005 for their assessment on LRG models). Deterministic approximations provide a computationally tractable way to integrate out latent variables, but it is not possible to quantify the error that they introduce in the quantification of uncertainty in predictions (although EP for LRG is reported to be very accurate in Kuss and Rasmussen 2005); also, those methods target the integration of latent variables only.

In the direction of providing a fully Bayesian treatment of GP models, it is necessary to integrate out latent variables as well as hyper-parameters, and this is usually done by quadrature methods (Cseke and Heskes 2011; Rue et al. 2009), thus limiting the number of hyper-parameters that can be employed in GP models.

Based on those considerations, this paper focuses on non-deterministic methods to carry out inference in GP models, and in particular on stochastic based approximations based Markov Chain Monte Carlo (MCMC) methods. The use of MCMC based inference methods is appealing as it provides asymptotic guarantees of convergence to exact inference. In practice, this translates into the possibility of achieving results with the desired level of accuracy (Flegal et al. 2007). Unfortunately, the use of MCMC methods for inference in GP models is extremely difficult. The aim of this paper is to discuss the challenges associated with MCMC based inference for GP models, and compare a number of strategies that have been proposed in the literature to tackle them. A preliminary version of this work can be found in Filippone et al. (2012b).^{1}

To the best of our knowledge, this work (i) is the first attempt to extensively assess the state-of-the-art in stochastic-based inference methods for GP models, and (ii) sets the bar for new MCMC methods for inference in GP models. Along with those contributions, this paper presents (iii) a variant of the Hybrid Monte Carlo algorithm that outperforms state-of-the-art methods to sample from the posterior distribution of the latent variables, and (iv) tests the combination of parameterizations, as recently proposed in Yu and Meng (2011), in the case of GP models.

### 1.1 Gaussian Process models

Let *X*={**x** _{1},…,**x** _{ n }} be a set of *n* input vectors described by a set of *d* covariates \(\mathbf {x}_{i} \in \mathbb {R}^{d}\), associated with observed responses **y**={*y* _{1},…,*y* _{ n }}. In GP models, the generative process modeling the observed data **y** given *X* is as follows. Observations are assumed to be conditionally independent given a set of *n* latent variables **f**={*f* _{1},…,*f* _{ n }}, and distributed according to a certain distribution depending on the particular type of data, e.g., Bernoulli for binary labels and Poisson for observations in the form of counts. This can be translated into a likelihood function of the form \(p(\mathbf {y}| \mathbf {f}) = \prod_{i=1}^{n} p(y_{i} | f_{i})\), where for generality the distribution *p*(*y* _{ i }|*f* _{ i }) is left unspecified.

*k*. The GP prior is a prior over functions, and the covariance structure given by

*k*specifies the characteristics of such functions (i.e., degree of smoothness and marginal variance). Let

*k*be parameterized by a vector of hyper-parameters \(\boldsymbol {\theta }= (\sigma, \psi_{\tau_{1}}, \ldots, \psi_{\tau_{d}})\), and assume:

*r*th covariate and

*σ*giving the marginal variance for latent variables. This type of covariance can be used for Automatic Relevance Determination (ARD) (Mackay 1994) of the covariates, as the values \(\tau_{i} = \exp(\psi_{\tau_{i}})\) can be interpreted as length-scale parameters. This definition of covariance function is adopted in many applications and is the one we will consider in the remainder of this paper. Exponentiation of the hyper-parameters is convenient, so that standard MCMC transition operators can be employed for \(\psi_{\tau_{i}}\) thus avoiding dealing with boundary conditions or non-standard MCMC proposals (Robert and Casella 2005). Let

*Q*be the matrix whose entries are

*q*

_{ ij }=

*q*(

**x**

_{ i },

**x**

_{ j }|

**ψ**_{ τ }); the covariance matrix

*K*will then be

*K*=

*σQ*. The model is fully specified by choosing a prior

*p*(

*) for the hyper-parameters. The model structure is therefore hierarchical, with hyper-parameters conditioning the latent variables that, in turn, condition observations, so that*

**θ***p*(

**y**,

**f**,

*)=*

**θ***p*(

**y**|

**f**)

*p*(

**f**|

*)*

**θ***p*(

*).*

**θ****x**

_{∗}can be written in the following way (for the sake of clarity we drop the explicit conditioning on

*X*and

**x**

_{∗}):

*y*

_{∗}given the GP modeling assumption.

**f**and

*, so that we can obtain a Monte Carlo estimate of the predictive distribution as follows:*

**θ***N*denotes the number of samples used to compute the estimate. In Eq. (3) we denoted the

*i*th samples from the posterior distribution of

**f**and

*obtained by means of MCMC methods by*

**θ****f**

^{(i)}and

**θ**^{(i)}. Note that the remaining integral is univariate and it is generally easy to evaluate.

### 1.2 Challenges in MCMC based inference for GP models

Sampling from the posterior of latent variables and hyper-parameters by joint proposals is not feasible; it is extremely unlikely to propose a set of latent variables and hyper-parameters that are compatible with each other and observed data. This forces one to consider schemes such as Gibbs sampling, where groups of variables are updated one at time, leading to the following challenges:

(i) Due to the hierarchical structure of GP models, chains converge slowly and mix poorly if the coupling effect between the groups of variables is not dealt with properly. This requires some form of reparameterization or clever proposal mechanism that efficiently decouples the dependencies between the groups of variables. This effect has drawn a lot of attention in the case of hierarchical models in general (Yu and Meng 2011), and recently in GP models (Knorr-Held and Rue 2002; Murray and Adams 2010). In Knorr-Held and Rue (2002) a joint update of latent variables and hyper-parameters is proposed with the aim of avoiding proposals for hyper-parameters to be conditioned on the values of latent variables. In Murray and Adams (2010) a parameterization based on auxiliary data is proposed that aims at reducing the coupling between the two groups of variables. Other ideas involve the use of reparameterizations based on whitening the latent variables; in the terminology of Yu and Meng (2011), this corresponds to employing the so called Ancillary Augmentation (AA) parameterization. Recently, Yu and Meng (2011) proposed to interweave parameterizations characterized by complementary features in order to boost sampling efficiency. Parameterizations can be complementary in the sense that they offer better performance in either strong or weak data limits; the idea of combining parameterizations is to achieve high sampling efficiency in both strong and weak data scenarios. We are interested in comparing the methods in Knorr-Held and Rue (2002), Murray and Adams (2010) and Yu and Meng (2011) applied to GP models. Another possibility would be to approximately integrate out latent variables and obtain samples from the corresponding approximate posterior of hyper-parameters. For GP classification this might be a sensible thing to do, as the Expectation Propagation approximation has been reported to be very accurate (Kuss and Rasmussen 2005); however, this is peculiar to GP classification and for general GP models it may not be the case.

(ii) Sampling hyper-parameters and latent variables cannot be done using exact Gibbs steps, and it requires proposals that are accepted/rejected based on a Hastings ratio, leading to a waste of expensive computations. Transition operators characterized by acceptance mechanisms embedded in a Gibbs sampler, are usually referred to as Metropolis-within-Gibbs operators. Designing proposals that guarantee high acceptance and independence between samples is extremely challenging, especially because latent variables can have dimensions in the order of hundreds or thousands. We will compare several transition operators, for different steps of the Gibbs sampler, with the aim of gaining insights about ways to strike a good balance between efficiency and computational cost. We will consider transition operators characterized by proposal mechanisms with increasing complexity, and in particular the Metropolis-Hastings (MH) operator which is based on random walk types of proposals, Hybrid Monte Carlo (HMC) which uses the gradient of the log-density of interest, and manifold methods (Girolami and Calderhead 2011) which use curvature information (i.e., second derivatives of the log-density).

The paper is organized as follows. Sections 2 and 3 report the parameterization strategies and the transition operators considered in this work. Sections 4 and 5 report an extensive comparison of those strategies and transition operators, on simulated and real data, on the basis of efficiency, convergence speed and computational complexity; Sect. 6 concludes the paper. For the sake of readability, most of the technical derivations can be found in the appendices.

## 2 Dealing with the hierarchical structure of GP models

### 2.1 Sufficient and ancillary augmentation

*Q*into the product of two factors

*LL*

^{T}, and view the generation of the latent variables as \(\mathbf {f}= \sqrt{\sigma} L \boldsymbol {\nu }\) with \(\boldsymbol {\nu }\sim \mathcal {N}(\boldsymbol {\nu }| \mathbf {0}, I)\), which implies that \(\mathbf {f}\sim \mathcal {N}(\mathbf {f}| \mathbf{0}, K)\). In the remainder of this paper, we will consider

*L*to be the lower triangular Cholesky decomposition of

*K*, but in principle any square root of

*K*could be used. In this way,

*is ancillary for*

**ν***and it is possible to express the joint density as*

**θ**### 2.2 Ancillarity-Sufficiency Interweaving Strategy (ASIS)

*r*of the scheme when the parameterizations are interweaved, is related to the rates of the two schemes

*r*

_{1}and

*r*

_{2}by \(r \leq R_{1,2} \sqrt{r_{1} r_{2}}\), where

*R*

_{1,2}is the maximal correlation between the latent variables for the two schemes. Given that the former expression implies

*r*≤max(

*r*

_{1},

*r*

_{2}), combining the two parameterizations leads to a scheme that is better than the worst. This is already an advantage compared to using a single scheme when one is in doubt on which scheme to use. However, the key result is the fact that

*R*

_{1,2}can be very small depending on the two parameterizations, so it is possible to make the combined scheme converge quickly even if neither of the individual schemes do. In general, this result is quite remarkable, as once different reparameterizations are available, combining them using the interweaving strategy is simple to implement, and can dramatically boost sampling efficiency. In GP models, the ASIS scheme amounts to interweaving SA and AA updates, that following Yu and Meng (2011) yields:

### 2.3 Knorr-Held and Rue (KHR)

The idea underpinning KHR, is to jointly sample parameters and latent variables as follows. Firstly, a set of hyper-parameters * θ*′|

*is proposed and secondly a set of latent variables conditioned on the new set of hyper-parameters, namely*

**θ****f**′|

**y**,

*′, is proposed. The proposal (*

**θ***′,*

**θ****f**′) is then jointly accepted or rejected according to a standard Hastings ratio. The key idea is to avoid making the proposal

*′ accepted on the basis of*

**θ****f**to avoid the strong coupling effect due to the hierarchical nature of the model. KHR was proposed in applications making use of Gaussian Markov Random Fields, and we will discuss the application of this idea for GP models in the section reporting the experiments. In order to avoid difficulties in devising a proposal for sampling from

**f**′|

**y**,

*′, here we set the proposal as the Gaussian obtained by constructing a Laplace approximation to*

**θ***p*(

**f**|

**y**,

*′).*

**θ**### 2.4 Surrogate method (SURR)

In the SURR method (Murray and Adams 2010), a set of auxiliary latent variables **g** is introduced as a noisy version of **f**; in particular, \(p(\mathbf {g}| \mathbf {f}, \boldsymbol {\theta }) = \mathcal {N}(\mathbf {g}| \mathbf {f}, S_{\boldsymbol {\theta }})\). This construction yields a conditional distribution for **f** of the form \(p(\mathbf {f}| \mathbf {g}, \boldsymbol {\theta }) = \mathcal {N}(\mathbf {f}| \mathbf {m}, R)\), with *R*=*S* _{ θ }−*S* _{ θ }(*S* _{ θ }+*K*)^{−1} *S* _{ θ } and \(\mathbf {m}= R S_{\boldsymbol {\theta }}^{-1} \mathbf {g}\). After decomposing *R*=*DD* ^{T}, the sampling of * θ* is then conditioned on the variables

*defined as*

**η****f**=

*D*

*+*

**η****m**. The covariance

*S*

_{ θ }is constructed to be diagonal with elements obtained by matching the posterior for each latent variable individually or by Taylor approximations (see Murray and Adams 2010 for details).

## 3 MCMC transition operators considered in this work

This section presents the transition operators considered in this work. We are interested in understanding whether and to what extent employing proposal mechanisms making use of gradient or curvature information of the target density improves sampling efficiency and speed of convergence with respect to computational complexity. We therefore consider transition operators with increasing complexity, and in particular the Metropolis-Hastings (MH) operator which is based on random walk types of proposals, the Hybrid Monte Carlo (HMC) operator which uses gradient information, and the Simplified Manifold Metropolis Adjusted Langevin Algorithm (SMMALA) operator which is one of the simplest manifold MCMC methods proposed in Girolami and Calderhead (2011) using curvature information.

For the sake of clarity, we will focus on the transitions operators for **f**, but the same operators can be easily applied to * θ*. We will first present MH, HMC, and SMMALA, and we will then discuss Elliptical Slice Sampling and a few variants of MH and HMC that have been specifically proposed for sampling

**f**, and do not have counterparts for

*. In the case of latent variables, the operators aim to leave the posterior*

**θ***p*(

**f**|

**y**,

*) invariant; in the remainder of this work,*

**θ***W*(

**f**) is defined as log[

*p*(

**y**|

**f**)

*p*(

**f**|

*)], which equals the log of the desired target density up to constants. In the case of hyper-parameters we can define the invariant distribution according to the chosen parameterization and apply the operators presented here for sampling*

**θ***rather than*

**θ****f**.

### 3.1 Metropolis-Hastings (MH)

The Metropolis-Hastings transition operator employs a proposal mechanism *g*(**f**′|**f**) based on a random walk (Robert and Casella 2005). A common choice is to use a multivariate Gaussian proposal with covariance *Σ* centered at the former position **f**, thus taking the form \(g(\mathbf {f}^{\prime}|\mathbf {f}) = \mathcal{N}(\mathbf {f}^{\prime} | \mathbf {f},\varSigma)\). For such a symmetric proposal mechanism, **f**′ is then accepted with probability min{1,exp(*W*(**f**′)−*W*(**f**))}.

### 3.2 Hybrid Monte Carlo (HMC)

**p**, that plays the role of a momentum variable, is drawn from \(\mathcal {N}(\mathbf {p}| \mathbf {0}, M)\), where the covariance matrix

*M*is the so called mass matrix. The joint density of

**f**and

**p**factorizes as

*p*(

**f**,

**p**)=exp(

*W*(

**f**))

*p*(

**p**), and the negative log-joint density reads

*W*(

**f**) and the kinetic energy by the quadratic form in

**p**. In order to draw proposals from

*p*(

**f**|

**y**,

*), we can simulate the particle for a certain time interval, introducing an analogous of time*

**θ***t*and solving Hamilton’s equations

*λ*steps, also known as leapfrog steps, and is volume preserving and reversible (see Neal 1993 for details). The leapfrog integrator yields an update of (

**f**,

**p**) into (

**f**

_{(λ)},

**p**

_{(λ)}). The discretization introduces an approximation such that the total energy is not conserved, so a Metropolis accept/reject step of the form min{1,exp(−

*H*(

**f**

_{(λ)},

**p**

_{(λ)})+

*H*(

**f**,

**p**))} is needed to ensure that HMC samples from the correct invariant distribution. The HMC transition operator is reported in Algorithm 1.

### 3.3 Manifold MCMC: Simplified Manifold MALA (SMMALA)

*S*={

*p*(

**y**|

*)|*

**ψ***∈*

**ψ***Ψ*} where

**y**denotes observed variables and

*comprises all model parameters. Under conditions that are generally satisfied for most commonly used models (Amari and Nagaoka 2000),*

**ψ***S*can be considered a

*C*

^{∞}manifold, and is called statistical manifold. Let \(\mathcal{L} = \log[p(\mathbf {y}| \boldsymbol {\psi })]\); the FI matrix

*G*of

*S*at

*is defined as:*

**ψ***S*.

**y**given

**f**. The manifold MALA (MMALA) algorithm (Girolami and Calderhead 2011) defines a Langevin diffusion with stationary distribution

*p*(

**f**|

*,*

**θ****y**) on the Riemann manifold of density functions, characterized by a metric tensor denoted as

*G*

_{ f,f }. By employing a first order Euler integrator to solve the diffusion, a proposal mechanism with density \(g(\mathbf {f}^{\prime} | \mathbf {f}) = \mathcal {N}(\mathbf {f}^{\prime} | \boldsymbol {\mu }(\mathbf {f},\epsilon), \epsilon^{2} G_{\mathbf {f},\mathbf {f}}^{-1})\) is obtained, where

*ϵ*is the integration step size, a parameter which needs to be tuned, and the

*d*th component of the mean function

*(*

**μ****f**,

*ϵ*)

_{ d }is

### 3.4 Elliptical Slice Sampling (ELL-SS)

**f**in GP models, and is based on slice sampling (Neal 2003). Due to the fact that latent variables are Gaussian, it is possible to derive this particular version of slice sampling, when constrained on an ellipse. For completeness, we report the transition operator in Algorithm 3 and we refer the reader to Murray et al. (2010) for further details. Note that ELL-SS is quite appealing as it returns a sample which does not need to be accepted or rejected (in fact, a rejection mechanism is implicit within step 5), and the proposal mechanism does not have any free parameters that need tuning.

### 3.5 Scaled versions of MH: MH v1 and MH v2

**z**is drawn from the GP prior \(\mathbf {z}\sim \mathcal {N}(\mathbf {z}| 0, K)\), and the proposal is constructed as follows:

*α*controls the degree of update. In MH v2, instead, the proposal is as follows:

### 3.6 Scaled versions of HMC: HMC v1 and HMC v2

By a similar argument as in MH, it is possible to introduce scaled versions of HMC that reduce the correlation between latent variables. This can be done by setting the mass matrix of HMC according to the precision of the posterior distribution of latent variables. Similarly, from an information geometric perspective, it is sensible to whiten latent variables according to the metric tensor of the statistical manifold. We notice that the metric tensor associated to the model for **y** given **f** is *K* ^{−1} plus a diagonal matrix which is a function of **f** (see Appendix A for full details). Whitening with respect to that metric tensor would be computationally very expensive for GP models, as it would require the simulation of the Hamiltonian dynamics on a manifold with a position-dependent curvature; this is implemented by RM-HMC which requires the derivatives of the metric tensor as well as implicit leapfrog iterations (Girolami and Calderhead 2011). In order to reduce the computational cost, we propose the following two options: (i) to approximate the diagonal term to be independent of **f** so that *M* ^{−1}=(*K* ^{−1}+*C*)^{−1}=*C* ^{−1}−*C* ^{−1}(*K*+*C* ^{−1})^{−1} *C* ^{−1} with *C* diagonal and independent of **f**; we call this variant HMC v1; (ii) to ignore the diagonal part of the metric tensor and set *M* ^{−1}=*K*; we call this variant HMC v2. In HMC v1, one simple way to make *C* independent of **f** is to compute it for the GP prior mean (which is zero), as proposed, e.g., in Christensen et al. (2005), Vanhatalo and Vehtari (2007).

In HMC v1, employing this formulation of HMC is convenient as computing the inverse of *M* is more stable than computing *M*=*K* ^{−1}+*C*, that requires a potentially unstable inversion of *K*. HMC v1 requires the computation of the inverse of the mass matrix and its factorization each time a new value of * θ* is proposed. In HMC v2, instead, no extra operations in

*O*(

*n*

^{3}) are required given that

*K*is already factorized, thus making it computationally very convenient.

## 4 Results on simulated data

In this section, we first report a study on the efficiency and speed of convergence of different transition operators in sampling from posterior distribution of individual groups of variables in the SA and AA parameterization. Secondly, we report the same analysis to compare different parameterizations to obtain samples from the joint posterior distribution of **f** and * θ*.

### 4.1 Experimental setup

We simulated data from the four GP models considered in this work, namely: LRG, LCX, VLT, and ORD. We generated 10 data sets simulating from each of the four models for all combinations of *n*=100,400, and *d*=2,10, for a total of 160 distinct data sets. In order to isolate the effect of different likelihood functions in the results, we seeded the generation of the input data matrix *X*, hyper-parameters, and latent variables so that these were the same across different models. Covariates were generated uniformly in the unit hyper-cube, and the parameters used to generate latent variables were *σ*=exp(2), \(\psi_{\tau_{i}} \sim U[-3, -1]\). We imposed Gamma priors on the length-scale parameters with shape *a* and rate *b*, *p*(*τ* _{ i })=Gam(*τ* _{ i }|*a*=1,*b*=1). We imposed an inverse Gamma prior *p*(*σ*)=invGam(*σ*|*a*=1,*b*=1), where *a* and *b* are shape and scale parameters respectively on *σ* to exploit conjugacy in the SA parameterization.

In all the experiments we collected 20000 samples after a burn-in phase of 5000 iterations; during the burn-in we also had an adaptive phase to allow the samplers reach recommended acceptance rates (for example around 25 % for MH). The transition operators for **f** had the following tuning parameters: *α* for MH v1 and MH v2, and *ε* for SMMALA and the variants of HMC which used a maximum of 10 leapfrog steps. The transition operators for * θ* employed the following proposals: MH used a covariance

*Σ*=

*αI*, HMC used a mass matrix

*M*=

*αI*and 10 maximum leapfrog steps, and SMMALA used a step-size

*ε*. Convergence analysis was performed using the \(\hat{R}\) potential scale reduction factor (Gelman and Rubin 1992), which is a classic score used to assess convergence of MCMC algorithms. The computation of the \(\hat{R}\) value is based on the within and between chain variances; a value close to one indicates that convergence is reached. The \(\hat{R}\) value was computed based on 10 chains initialized from the prior to study what efficiency can be achieved without running preliminary simulations; this is different from the initialization procedure suggested in Gelman and Rubin (1992) that requires locating the modes of the target density. Due to the fairly diffuse priors on the length-scale parameters, we noticed difficulty in achieving convergence in some cases; we therefore initialized \(\psi_{\tau_{i}}\) randomly in the interval [−3,−1]. The value of \(\hat{R}\) was checked at 1000, 2000, 5000, 10000, 20000 iterations. We use the following procedure to compactly visualize the speed of convergence; we threshold the median value of \(\hat{R}\) across 10 data sets at each checkpoint and use the following visual coding to report speed of convergence: Open image in new window , so that Open image in new window indicates that \(\hat{R} < 1.1\), Open image in new window indicates that \(1.1 < \hat{R} < 1.3\), and so on. We then stack the rectangles associated to each checkpoint where we computed the value of \(\hat{R}\), thus producing a sort of histogram of the median of \(\hat{R}\) over the iterations. Efficiency of MCMC methods is compared based on the minimum of the Effective Sample Size (ESS) (Robert and Casella 2005) computed across all the sampled variables. We then report its mean and standard deviation across the 10 chains and the 10 different data sets for each combination of size of the data set, dimensionality, and type of likelihood.

We are also interested in statistically assessing which methods achieve faster convergence. In order to do so, we perform pairwise Mann-Whitney tests with significance level of 0.05 comparing the value of \(\hat{R}\) at the last checkpoint for all the chains across 10 data sets. This allows us to obtain an ordering of methods in terms of convergence speed. In each table we include a row at the bottom reporting the result of such a test. We denote by 1|2 situations where the method in row 1 of the corresponding table converges significantly faster than the method in row 2. Instead, the notation 1,2 is used when the method in row 1 does not converge significantly faster than the method in row 2.

As a measure of complexity, we counted the number of operations with complexity in *O*(*n* ^{3}), namely number of Cholesky factorizations of *n*×*n* matrices (#C), number of inversions of *n*×*n* matrices (#I),^{2} and number of multiplications of *n*×*n* matrices (#M). We believe that this is a more reliable measure of complexity with respect to running time, as running time can be affected by several implementation details and other factors that are not directly related to the actual complexity of the algorithms.

### 4.2 Assessing the efficiency of samplers for individual groups of variables

*T*represents the number of iterations,

*d*the number of covariates and \(\bar{\lambda}\) the average number of leapfrog steps in HMC transition operators. In the following sub-sections we present results about the sampling of the latent variables and hyper-parameters separately.

Breakdown of the number of operations in *O*(*n* ^{3}) required to apply the transition operators considered in this work. #M, #I and #C represent number of multiplication of *n*×*n* matrices, inversions of *n*×*n* matrices, and number of Cholesky decompositions respectively. Counts are reported as functions of the number of iterations *T* and number of covariates *d*. In HMC, \(\bar{\lambda}\) denotes the average number of leapfrog steps in one iteration

y | f | ,νy | |||||||
---|---|---|---|---|---|---|---|---|---|

#M | #I | #C | #M | #I | #C | #M | #I | #C | |

MH | 0 | 0 | 1 | 0 | 0 | | 0 | 0 | |

HMC | 0 | 0 | 1 | 0 | \(T \bar{\lambda}\) | | 0 | 0 | \(T + Td \bar{\lambda}\) |

SMMALA | 0 | 1 | | | | | 0 | 0 | |

ELL-SS | 0 | 0 | 1 | − | − | − | − | − | − |

MH v1 | 0 | 0 | 1 | − | − | − | − | − | − |

MH v2 | 0 | 0 | 1 | − | − | − | − | − | − |

HMC v1 | 0 | 1 | 2 | − | − | − | − | − | − |

HMC v2 | 0 | 0 | 1 | − | − | − | − | − | − |

#### 4.2.1 Sampling **f**|**y**,**θ**

**θ**

**f**. The results can be found in Table 2, and they were obtained by fixing

*to the values used to generate the data. We notice that different likelihood functions heavily affect efficiency and convergence speed; in the examples considered here, the results show that in LRG it is possible to achieve efficiency one order of magnitude higher than in other models. The scaled versions of MH work well in the case of LRG (MH v1 is slightly better than MH v2), but do not offer guarantees of convergence on other models. ELL-SS achieves better efficiency and convergence than the scaled versions of MH. SMMALA, which uses gradient and curvature information, achieves good efficiency and faster convergence than MH v1, MH v2, and ELL-SS, but at the cost of one operation in*

**θ***O*(

*n*

^{3}) at each iteration, as the metric tensor is a function of

**f**and needs to be factorized at each iteration. Overall, the results suggest that the scaled versions of HMC are the best sampling methods for

**f**|

*,*

**θ****y**. HMC v1 is slightly better than HMC v2, but it requires one extra inversion and one extra Cholesky decomposition compared to HMC v2 that does not require any operations in

*O*(

*n*

^{3}) once the covariance matrix of the GP is factorized.

Comparison of transition operators to sample **f**|**y**,* θ* for data generated from models with four different likelihoods. Minimum ESS is averaged over 10 chains for 10 different data sets for each value of

*n*and

*d*. The

*last row in each sub-table*reports the result of the statistical test to assess which operators achieve significantly faster convergence

#### 4.2.2 SA parameterization—sampling **θ**|**f**

**θ**

In this section we present results about the sampling of hyper-parameters from the posterior distribution * θ*|

**f**,

**y**which, given the hierarchical structure of the model, is simply

*|*

**θ****f**independent from the data model. As reported in Table 1, the complexity of applying SMMALA and HMC is quite high compared to MH. MH requires one Cholesky factorization of

*Q*at each iteration. In HMC, at each leapfrog step, the gradients of

*Q*with respect to

*are needed and the cheapest way to do this is by inverting*

**θ***Q*first and noticing that all the remaining operations are in

*O*(

*n*

^{2}); this is done \(\bar{\lambda}\) times on average at every iteration of HMC. Similarly, in SMMALA the gradient can be computed by inverting

*Q*first; by doing so, the metric tensor can then be computed by

*d*multiplications with the derivatives of

*Q*and no other

*O*(

*n*

^{3}) operations.

**f**to the value used to generate the data and sampling only the length-scale parameters, as

*σ*can be efficiently sampled using exact Gibbs steps. HMC improves quite substantially on efficiency, but not on speed of convergence; it may be worth employing some rescaling of the hyper-parameters to improve on this as suggested by Neal (1996). The performance of SMMALA is highly variable in efficiency and it converges more slowly than MH and HMC. This might be due to the skewness of the target distribution, that is known to affect the efficiency of SMMALA (Stathopoulos and Filippone 2011). The results indicate that MH strikes a good balance between efficiency and computational cost.

Comparison of transition operators to sample * θ*|

**f**. Minimum ESS is averaged over 10 chains for 10 different data sets for each value of

*n*and

*d*. The

*last row*reports the result of the statistical test to assess which operators achieve significantly faster convergence

#### 4.2.3 AA parameterization—sampling **θ**|**y**,**ν**

**θ**

**ν**

In this section we present the sampling of the hyper-parameters from the posterior distribution * θ*|

**y**,

*, where we fixed*

**ν***to the values used to generate the data. The analysis of complexity shows that MH requires one Cholesky factorization at each iteration. In HMC, each leapfrog requires computing*

**ν***L*and the gradient of

*L*with respect to

*and no other operations in*

**θ***O*(

*n*

^{3}); this can be computed using the differentiation of the Cholesky algorithm which requires

*d*operations in

*O*(

*n*

^{3}) (Smith 1995). Likewise, for SMMALA

*L*and the

*d*derivatives of

*L*with respect to

*are the only operations in*

**θ***O*(

*n*

^{3}) needed.

Comparison of transition operators to sample * θ*|

**y**,

*for data generated from models with four different likelihoods. Minimum ESS is computed as the average over 10 chains for 10 different data sets for each value of*

**ν***n*and

*d*. The

*last row in each sub-table*reports the result of the statistical test to assess which operators achieve significantly faster convergence

### 4.3 Assessing the efficiency of different parameterizations

**f**using HMC v2 and

*using MH; HMC v2 and MH where adapted during the burn-in phase and in HMC v2 we set the maximum number of leapfrog steps to 10. For the sake of brevity, we focus on the LRG model only; the results on efficiency and speed of convergence in sampling hyper-parameters are reported in Table 5.*

**θ**Comparison of different strategies to sample **f**,* θ*|

**y**for data generated from a LRG model. The

*rightmost column*reports the complexity of the different methods with respect to number of inversion and Cholesky decompositions. In KHR, \(\bar{\kappa}\) represents the average number of iterations to run the Laplace Approximation

It is striking to see how challenging it is to efficiently sample from the posterior distribution of latent variables and hyper-parameters. Sampling efficiency is generally low; this is consistent with our experience in other applications involving sampling in hierarchical models (Filippone et al. 2012a). As expected, the SA parameterization is the worst among the ones we tested. The AA parameterization, ASIS, and SURR generally offer good guarantees of convergence within a few thousand iterations. SURR seems to be superior in efficiency, which is consistent with what reported in Murray and Adams (2010), but it requires more operations in *O*(*n* ^{3}) compared to AA and ASIS. ASIS slightly improves efficiency and speed of convergence with respect to the AA scheme but requires double the number of operations in *O*(*n* ^{3}). KHR seems effective in breaking the correlation between the two groups of variables, but it may require several iterations within the approximation used to sample **f**. In the experiments considered here \(\bar{\kappa}\) is around 8, so the best compromise between computations and efficiency seems to be given by the AA and ASIS parameterizations.

## 5 Results on real data

Comparison of different strategies to sample **f**,* θ*|

**y**in four UCI data sets modeled using a LRG model

## 6 Conclusions

In this paper we studied and compared a number of state-of-the-art strategies to carry out the fully Bayesian treatment of GP models. We focused on four GP models and performed an extensive evaluation of efficiency, convergence speed, and computational complexity of several transition operators and sampling strategies.

The results in this paper show that latent variables can be sampled quite efficiently with little computational effort once the GP covariance matrix is factorized. This can be achieved by a simple variant of HMC that we introduced in this paper. About sampling hyper-parameters in different parameterizations, the results presented here indicate that the gain in sampling efficiency given by the use of complicated proposal mechanisms does not scale as much as their computational cost. It would be interesting to investigate some recently proposed variants to slice sampling (Thompson and Neal 2010) and Hybrid Monte Carlo (Hoffman and Gelman 2012) on the sampling of hyper-parameters.

The analysis of the results obtained by different parameterization suggest that AA is a sensible and computationally cheap parameterization with good convergence properties. AA performs similarly to ASIS at half the computational cost. It makes sense, however, to employ ASIS when in doubt about the best parameterization to use, although GP models with full covariance matrices will generally fall into the weak data limit as the *O*(*n* ^{2}) space and *O*(*n* ^{3}) time complexities constrain the number of data that can be processed.

In general, the results show how challenging it is to efficiently sample from the posterior distribution of latent variables and hyper-parameters in GP models and motivates further research into methods to do this efficiently. Some sampling strategies, such as the one based on the AA parameterization, are capable of achieving convergence within a reasonable number of iterations, and this makes it possible to carry out the fully Bayesian treatment of GP models dealing with a small to moderate number of samples. We have recently demonstrated that this is indeed the case in Filippone et al. (2012a), but more needs to be done in the direction of developing robust stochastic based inference methods for GP models.

It would be interesting to investigate how performance is affected by the choice of the design, which in the simulated data presented here was assumed uniform. Also, we studied in particular GP models with the squared exponential ARD covariance function. It would be interesting to compare the method considered here in models characterized by other covariance functions, such as the Matérn, or sparse inverse covariance functions as in Rue et al. (2009); the latter would make it possible to test the strong data limit case. Finally, in this study we have not included a mean function for the GP prior or extra parameters for the likelihood function. This would require including the sampling of other quantities that may further impact on efficiency and speed of convergence.

## Footnotes

- 1.
An implementation of the methods considered in this paper can be found at http://www.dcs.gla.ac.uk/~maurizio/pages/code.html.

- 2.
This is a shorthand notation to denote a back and forward substitution of the identity matrix using Cholesky factors.

## References

- Amari, S., & Nagaoka, H. (2000).
*Translations of mathematical monographs: Vol.**191*.*Methods of information geometry*. Oxford: Oxford University Press.zbMATHGoogle Scholar - Asuncion, A., & Newman, D. J. (2007). UCI Machine Learning Repository.Google Scholar
- Christensen, O. F., Roberts, G. O., & Rosenthal, J. S. (2005). Scaling limits for the transient phase of local Metropolis–Hastings algorithms.
*Journal of the Royal Statistical Society. Series B. Statistical Methodology*,*67*(2), 253–268.MathSciNetzbMATHCrossRefGoogle Scholar - Chu, W., & Ghahramani, Z. (2005). Gaussian processes for ordinal regression.
*Journal of Machine Learning Research*,*6*, 1019–1041.MathSciNetzbMATHGoogle Scholar - Cseke, B., & Heskes, T. (2011). Approximate marginals in latent Gaussian models.
*Journal of Machine Learning Research*,*12*, 417–454.MathSciNetGoogle Scholar - Filippone, M., Marquand, A. F., Blain, C. R. V., Williams, S. C. R., Mourão-Miranda, J., & Girolami, M. (2012a). Probabilistic prediction of neurological disorders with a statistical assessment of neuroimaging data modalities.
*Annals of Applied Statistics*,*6*(4), 1883–1905.MathSciNetzbMATHCrossRefGoogle Scholar - Filippone, M., Zhong, M., & Girolami, M. (2012b).
*On the fully Bayesian treatment of latent Gaussian models using stochastic simulations*(Technical Report TR-2012-329). School of Computing Science, University of Glasgow.Google Scholar - Flegal, J. M., Haran, M., & Jones, G. L. (2007). Markov chain Monte Carlo: can we trust the third significant figure?
*Statistical Science*,*23*(2), 250–260.MathSciNetCrossRefGoogle Scholar - Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences.
*Statistical Science*,*7*(4), 457–472.CrossRefGoogle Scholar - Girolami, M., & Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods.
*Journal of the Royal Statistical Society. Series B. Statistical Methodology*,*73*(2), 123–214.MathSciNetCrossRefGoogle Scholar - Hoffman, M. D., & Gelman, A. (2011).
*The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo*. arXiv:1111.4246.*Journal of Machine Learning Research*, to appear. - Knorr-Held, L., & Rue, H. (2002). On block updating in Markov random field models for disease mapping.
*Scandinavian Journal of Statistics*,*29*(4), 597–614.MathSciNetzbMATHCrossRefGoogle Scholar - Kuss, M., & Rasmussen, C. E. (2005). Assessing approximate inference for binary Gaussian process classification.
*Journal of Machine Learning Research*,*6*, 1679–1704.MathSciNetzbMATHGoogle Scholar - Mackay, D. J. C. (1994). Bayesian methods for backpropagation networks. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.),
*Models of neural networks III*(pp. 211–254). Berlin: Springer. Chap. 6.Google Scholar - Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In
*Proceedings of the 17th conference in uncertainty in artificial intelligence (UAI ’01)*, San Francisco, CA, USA (pp. 362–369). San Mateo: Morgan Kaufmann.Google Scholar - Møller, J., Syversveen, A. R., & Waagepetersen, R. P. (1998). Log Gaussian Cox processes.
*Scandinavian Journal of Statistics*,*25*(3), 451–482.MathSciNetCrossRefGoogle Scholar - Murray, I., & Adams, R. P. (2010). Slice sampling covariance hyperparameters of latent Gaussian models. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.),
*NIPS*(pp. 1732–1740). Red Hook: Curran Associates.Google Scholar - Murray, I., Adams, R. P., & MacKay, D. J. C. (2010). Elliptical slice sampling.
*Journal of Machine Learning Research*,*9*, 541–548.Google Scholar - Neal, R. (2003). Slice sampling.
*The Annals of Statistics*,*31*, 705–767.MathSciNetzbMATHCrossRefGoogle Scholar - Neal, R. M. (1993).
*Probabilistic inference using Markov chain Monte Carlo methods*(Technical Report CRG-TR-93-1). Dept. of Computer Science, University of Toronto.Google Scholar - Neal, R. M. (1996).
*Lecture notes in statistics:**Bayesian learning for neural networks*. Berlin: Springer.zbMATHCrossRefGoogle Scholar - Neal, R. M. (1999). Regression and classification using Gaussian process priors (with discussion).
*Bayesian Statistics*,*6*, 475–501.MathSciNetGoogle Scholar - Opper, M., & Winther, O. (2000). Gaussian processes for classification: mean-field algorithms.
*Neural Computation*,*12*(11), 2655–2684.CrossRefGoogle Scholar - Rasmussen, C. E., & Williams, C. (2006).
*Gaussian processes for machine learning*. Cambridge: MIT Press.zbMATHGoogle Scholar - Robert, C. P., & Casella, G. (2005).
*Monte Carlo statistical methods (Springer texts in statistics)*. New York: Springer.Google Scholar - Roberts, G. O., & Stramer, O. (2002). Langevin diffusions and Metropolis-Hastings algorithms.
*Methodology and Computing in Applied Probability*,*4*(4), 337–357.MathSciNetzbMATHCrossRefGoogle Scholar - Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.
*Journal of the Royal Statistical Society. Series B. Statistical Methodology*,*71*(2), 319–392.MathSciNetzbMATHCrossRefGoogle Scholar - Smith, S. P. (1995). Differentiation of the Cholesky algorithm.
*Journal of Computational and Graphical Statistics*,*4*(2), 134–147.Google Scholar - Stathopoulos, V., & Filippone, M. (2011). Discussion of the paper “Riemann manifold Langevin and Hamiltonian Monte Carlo methods” by Mark Girolami and Ben Calderhead.
*Journal of the Royal Statistical Society. Series B. Statistical Methodology*,*73*(2), 167–168.Google Scholar - Thompson, M., & Neal, R. M. (2010).
*Covariance-adaptive slice sampling*(Technical Report 1002). Department of Statistics, University of Toronto.Google Scholar - Tierney, L., & Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities.
*Journal of the American Statistical Association*,*81*(393), 82–86.MathSciNetzbMATHCrossRefGoogle Scholar - Vanhatalo, J., & Vehtari, A. (2007). Sparse log Gaussian processes via MCMC for spatial epidemiology.
*Journal of Machine Learning Research*,*1*, 73–89.Google Scholar - Wilson, A. G., & Ghahramani, Z. (2010). Copula processes. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.),
*NIPS*(pp. 2460–2468). Red Hook: Curran Associates.Google Scholar - Yu, Y., & Meng, X.-L. (2011). To center or not to center: that is not the question—an ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency.
*Journal of Computational and Graphical Statistics*,*20*(3), 531–570.MathSciNetCrossRefGoogle Scholar