# Hilbert space methods for reduced-rank Gaussian process regression

- 280 Downloads
- 1 Citations

## Abstract

This paper proposes a novel scheme for reduced-rank Gaussian process regression. The method is based on an approximate series expansion of the covariance function in terms of an eigenfunction expansion of the Laplace operator in a compact subset of \(\mathbb {R}^d\). On this approximate eigenbasis, the eigenvalues of the covariance function can be expressed as simple functions of the spectral density of the Gaussian process, which allows the GP inference to be solved under a computational cost scaling as \(\mathcal {O}(nm^2)\) (initial) and \(\mathcal {O}(m^3)\) (hyperparameter learning) with *m* basis functions and *n* data points. Furthermore, the basis functions are independent of the parameters of the covariance function, which allows for very fast hyperparameter learning. The approach also allows for rigorous error analysis with Hilbert space theory, and we show that the approximation becomes exact when the size of the compact subset and the number of eigenfunctions go to infinity. We also show that the convergence rate of the truncation error is independent of the input dimensionality provided that the differentiability order of the covariance function increases appropriately, and for the squared exponential covariance function it is always bounded by \({\sim }1/m\) regardless of the input dimensionality. The expansion generalizes to Hilbert spaces with an inner product which is defined as an integral over a specified input density. The method is compared to previously proposed methods theoretically and through empirical tests with simulated and real data.

## Keywords

Gaussian process regression Laplace operator Eigenfunction expansion Pseudo-differential operator Reduced-rank approximation## 1 Introduction

*n*grows large. The computational requirements arise because in solving the GP regression problem we need to invert the \(n \times n\) Gram matrix \(\mathbf {K} + \sigma _{n}^2\mathbf {I}\), where \(\mathbf {K}_{ij} = k(\mathbf {x}_i,\mathbf {x}_j)\), which is an \(\mathcal {O}(n^3)\) operation in general.

To overcome this problem, over the years, several schemes have been proposed. They typically reduce the storage requirements to \(\mathcal {O}(nm)\) and complexity to \(\mathcal {O}(nm^2)\), where \(m < n\). Some early methods have been reviewed in Rasmussen and Williams (2006), and Quiñonero-Candela and Rasmussen (2005b) provide a unifying view on several methods. From a spectral point of view, several of these methods (e.g., SOR, DTC, VAR, FIC) can be interpreted as modifications to the so-called *Nyström method* (see Baker 1977; Williams and Seeger 2001), a scheme for approximating the eigenspectrum.

For stationary covariance functions, the spectral density of the covariance function can be employed: In this context, the spectral approach has mainly been considered in regular grids, as this allows for the use of FFT-based methods for fast solutions (see Paciorek 2007; Fritz et al. 2009) and more recently in terms of converting GPs to state space models (Särkkä and Hartikainen 2012; Särkkä et al. 2013). Recently, Lázaro-Gredilla et al. (2010) proposed a sparse spectrum method where a randomly chosen set of spectral points span a trigonometric basis for the problem.

The methods proposed in this article fall into the class of methods called reduced-rank approximations (see, e.g., Rasmussen and Williams 2006, Ch. 8) which are based on approximating the Gram matrix \(\mathbf {K}\) with a matrix \(\tilde{\mathbf {K}}\) with a smaller rank \(m < n\). This allows for the use of matrix inversion lemma (Woodbury formula) to speed up the computations. It is well known that the optimal reduced-rank approximation of the Gram (covariance) matrix \(\mathbf {K}\) with respect to the Frobenius norm is \(\tilde{\mathbf {K}} = \varvec{\Phi }\varvec{\Lambda }\varvec{\Phi }^\mathsf {T}\), where \(\varvec{\Lambda }\) is a diagonal matrix of the leading *m* eigenvalues of \(\mathbf {K}\) and \(\varvec{\Phi }\) is the matrix of the corresponding orthonormal eigenvectors (Golub and Van Loan 1996; Rasmussen and Williams 2006, Ch. 8). Yet, as computing the eigendecomposition is an \(\mathcal {O}(n^3)\) operation, this provides no remedy as such.

In this work, we propose a novel method for obtaining approximate eigendecompositions of covariance functions in terms of an eigenfunction expansion of the Laplace operator in a compact subset of \(\mathbb {R}^d\). The method is based on interpreting the covariance function as the kernel of a pseudo-differential operator (Shubin 1987) and approximating it using Hilbert space methods (Courant and Hilbert 2008; Showalter 2010). This results in a reduced-rank approximation for the covariance function, where the basis functions are independent of the covariance functions and its parameters. We also show that the approximation converges to the exact solution in well-defined conditions, analyze its convergence rate and provide theoretical and experimental comparisons to existing state-of-the-art methods. This path has not been explored in GP regression context before, although the approach is related to the Fourier feature methods (Hensman et al. 2018) and stochastic partial differential equation-based methods recently introduced to spatial statistics and GP regression (Lindgren et al. 2011; Särkkä and Hartikainen 2012; Särkkä et al. 2013) as well as to classical works in the spectral representations of stochastic processes (Loève 1963; Van Trees 1968; Adler 1981; Cramér and Leadbetter 2013) and spline interpolation (Wahba 1978, 1990; Kimeldorf and Wahba 1970). Recently, the scalable eigendecomposition approach has also been tackled by various structure exploiting methods (building on the work by Wilson and Nickisch 2015) and extended to methods exploiting GPU computations.

This paper is structured as follows: In Sect. 2, we derive the approximative series expansion of the covariance functions. Section 3 is dedicated to applying the approximation scheme to GP regression and providing details of the computational benefits. We provide a detailed analysis of the convergence of the method in Sect. 4. Sections 5 and 6 provide comparisons to existing methods, the former from a more theoretical point of view, whereas the latter contains examples and comparative evaluation on several datasets. Finally, the properties of the method are summarized and discussed in Sect. 7.

## 2 Approximating the covariance function

In this section, we start by stating the assumptions and properties of the class of covariance functions that we are considering and show how a homogenous covariance function can be considered as a pseudo-differential operator constructed as a series of Laplace operators. Then we show how the pseudo-differential operators can be approximated with Hilbert space methods on compact subsets of \(\mathbb {R}^d\) or via inner products with integrable weight functions and discuss connections to Sturm–Liouville theory.

### 2.1 Spectral densities of homogeneous and isotropic Gaussian processes

*Bochner’s theorem*(see, e.g., Akhiezer and Glazman 1993; Da Prato and Zabczyk 1992) which states that a bounded continuous positive definite function \(k(\mathbf {r})\) can be represented as

*spectral density*\(S({\omega })\) corresponding to the covariance function \(k(\mathbf {r})\). This gives rise to the Fourier duality of covariance and spectral density, which is known as the

*Wiener–Khintchin theorem*(Rasmussen and Williams 2006, Ch. 4), giving the identities

*isotropic*, that is, it only depends on the Euclidean norm \(||\mathbf {r}||\) such that \(k(\mathbf {r}) \triangleq k(||\mathbf {r}||)\), then the spectral density will also only depend on the norm of \({\omega }\) such that we can write \(S({\omega }) \triangleq S(||{\omega }||)\). In the following, we assume that the considered covariance functions are indeed isotropic, but the approach can be generalized to more general homogenous covariance functions.

### 2.2 The covariance operator as a pseudo-differential operator

*f*we have

### 2.3 Hilbert space approximation of the covariance operator

*f*in the domain \(\varOmega \) assuming Dirichlet boundary conditions.

*f*and in the current domain with the assumed boundary conditions.

*j*th eigenvalue and \(\phi _j(\cdot )\) the eigenfunction of the Laplace operator in a given domain. These expressions tend to be simple closed-form expressions.

The right-hand side of (20) is very easy to evaluate, because it corresponds to evaluating the spectral density at the square roots of the eigenvalues and multiplying them with the eigenfunctions of the Laplace operator. Because the eigenvalues of the Laplace operator are monotonically increasing with *j* and for bounded covariance functions the spectral density goes to zero fast with higher frequencies, we can expect to obtain a good approximation of the right-hand side by retaining only a finite number of terms in the series. However, even with an infinite number of terms this is only an approximation, because we assumed a compact domain with boundary conditions. The approximation can be, though, expected to be good at the input values which are not near the boundary of \(\varOmega \), where the Laplacian was taken to be zero.

As an example, Fig. 1 shows Matérn covariance functions of various degrees of smoothness \(\nu \) (see, e.g., Rasmussen and Williams 2006, Ch. 4) and approximations for different numbers of basis functions in the approximation. The basis consists of the eigenfunctions of the Laplacian in (10) with \(\varOmega = [-\,L,L]\) which gives the eigenfunctions \(\phi _j(x) = L^{-1/2} \sin (\pi j (x + L)/ (2L))\) and the eigenvalues \(\lambda _j = (\pi \, j / (2L))^2\). In the figure, we have set \(L = 1\) and \(\ell = 0.1\). For the squared exponential, the approximation is indistinguishable from the exact curve already at \(m=12\), whereas the less smooth functions require more terms.

### 2.4 Inner product point of view

*Karhunen–Loeve expansion*for a sample function \(f(\mathbf {x})\) with zero mean and the above covariance function:

### 2.5 Connection to Sturm–Liouville theory

*x*is scalar, the eigenvalue problem in Eq. (23) can be written in Sturm–Liouville form as follows:

*w*(

*r*)). The Laplacian in spherical coordinates is

## 3 Application of the method to GP regression

In this section, we show how the approximation (20) can be used in Gaussian process regression. We also write down the expressions needed for hyperparameter learning and discuss the computational requirements of the methods.

### 3.1 Gaussian process regression

*f*are assumed to be realizations of a Gaussian random process prior and the observations corrupted by Gaussian noise:

*n*-dimensional vector with the

*i*th entry being \(k(\mathbf {x}_*,\mathbf {x}_i)\), and \(\mathbf {y}\) is a vector of the

*n*observations.

*m*basis functions of the Laplacian as given in Eq. (20) such that

*m*approximate eigenvalues such that \(\varvec{\Lambda }_{jj} = S(\sqrt{\lambda _j}), j=1,2,\ldots ,m\). Here \(S(\cdot )\) is the spectral density of the Gaussian process and \(\lambda _j\) the

*j*th eigenvalue of the Laplace operator. The corresponding eigenvectors in the decomposition are given by the eigenvectors \(\phi _j(\mathbf {x})\) of the Laplacian such that \(\varvec{\Phi }_{ij} = \phi _j(\mathbf {x}_i)\).

*m*-dimensional vector with the

*j*th entry being \(\phi _j(\mathbf {x}_*)\). Thus, when the size of the training set is higher than the number of required basis functions \(n > m\), the use of this approximation is advantageous.

### 3.2 Learning the hyperparameters

Once the marginal likelihood and its derivatives are available, it is also possible to use other methods for parameter inference such as Markov chain Monte Carlo methods (Liu 2001; Brooks et al. 2011) including Hamiltonian Monte Carlo (HMC, Duane et al. 1987; Neal 2011) as well as numerous others.

### 3.3 Discussion on the computational complexity

As can be noted from Eq. (20), the basis functions in the reduced-rank approximation do not depend on the hyperparameters of the covariance function. Thus it is enough to calculate the product \(\varvec{\Phi }^\mathsf {T}\varvec{\Phi }\) only once, which means that the method has a overall asymptotic computational complexity of \(\mathcal {O}(nm^2)\). After this initial cost, evaluating the marginal likelihood and the marginal likelihood gradient is an \(\mathcal {O}(m^3)\) operation—which in practice comes from the Cholesky factorization of \(\mathbf {Z}\) on each step.

If the number of observations *n* is so large that storing the \(n \times m\) matrix \(\varvec{\Phi }\) is not feasible, the computations of \(\varvec{\Phi }^\mathsf {T}\varvec{\Phi }\) can be carried out in blocks. Storing the evaluated eigenfunctions in \(\varvec{\Phi }\) is not necessary, because the \(\phi _j(\mathbf {x})\) are closed-form expressions that can be evaluated when necessary. In practice, it might be preferable to cache the result of \(\varvec{\Phi }^\mathsf {T}\varvec{\Phi }\) (causing a memory requirement scaling as \(\mathcal {O}(m^2)\)), but this is not required.

The computational complexity of conventional sparse GP approximations typically scale as \(\mathcal {O}(nm^2)\) in time for each step of evaluating the marginal likelihood. The scaling in demand of storage is \(\mathcal {O}(nm)\). This comes from the inevitable cost of re-evaluating all results involving the basis functions on each step and storing the matrices required for doing this. This applies to all the methods that will be discussed in Sect. 5, with the exception of SSGP, where the storage demand can be relaxed by re-evaluating the basis functions on demand.

We can also consider the rather restricting, but in certain applications often encountered case, where the measurements are constrained to a regular grid. This causes the product of the orthonormal eigenfunction matrices \(\varvec{\Phi }^\mathsf {T}\varvec{\Phi }\) to be diagonal, avoiding the calculation of the matrix inverse altogether. This relates to the FFT-based methods for GP regression (Paciorek 2007; Fritz et al. 2009), and the projections to the basis functions can be evaluated by fast Fourier transform in \(\mathcal {O}(n \log n)\) time complexity.

### 3.4 Inverse problems and latent force models

*i*th entry of vector \(\mathbf {k}_{*h}\) is \((\mathcal {H}'\, k(\mathbf {x}_*,\cdot ))(\mathbf {x}_i)\), and \(\mathbf {y}\) is the vector of observations. Here \(\mathcal {H}'\) denotes that the operator is applied to the second variable \(\mathbf {x}'\) of the argument. With the series expansion (20), we can easily approximate

## 4 Convergence analysis

In this section, we analyze the convergence of the proposed approximation when the size of the domain \(\varOmega \) and the number of terms in the series grows to infinity. We start by analyzing a univariate problem in the domain \(\varOmega = [-L,L]\) and with Dirichlet boundary conditions and then generalize the result to *d*-dimensional cubes \(\varOmega = [-L_1,L_1] \times \cdots \times [-L_d,L_d]\). Then we analyze the truncation error as function of the number of terms in the series. We also discuss how the analysis could be extended to other types of basis functions.

### 4.1 Univariate Dirichlet case

*m*-term approximation has theform

*L*is bounded below by a constant.

The univariate convergence result can be summarized as the following theorem which is proved in “Appendix A.2.”

### Theorem 1

*E*(independent of

*m*,

*x*, and \(x'\)) such that

### Remark 2

Note that we cannot simply exchange the order of the limits in the above theorem. However, the theorem does ensure the convergence of the approximation in the joint limit \(m,L \rightarrow \infty \) provided that we add terms to the series fast enough such that \(m / L \rightarrow \infty \). That is, in this limit, the approximation \(\widetilde{k}_m(x,x')\) converges uniformly to \(k(x,x')\).

As such, the results above only ensure the convergence of the prior covariance functions. However, it turns out that this also ensures the convergence of the posterior as is summarized in the following corollary.

### Corollary 3

Because the Gaussian process regression equations only involve point-wise evaluations of the kernels, it also follows that the posterior mean and covariance functions converge uniformly to the exact solutions in the limit \(m,L \rightarrow \infty \).

### Proof

Analogous to proof of Theorem 2.2 in Särkkä and Piché (2014). \(\square \)

### 4.2 Multivariate Cartesian Dirichlet case

*d*-dimensional inputs space with rectangular domain \(\varOmega = [-\,L_1,L_1] \times \cdots \times [-\,L_d,L_d]\) with Dirichlet boundary conditions. In this case, we consider a truncated \(m = {{\hat{m}}}^d\) term approximation of the form

*d*-dimensional cube \([-\widetilde{L},\widetilde{L}]^d\) and that \(L_k\)s are bounded from below.

The following result for this *d*-dimensional case is proved in “Appendix A.3.”

### Theorem 4

*E*(independent of

*m*,

*d*, \(\mathbf {x}\), and \(\mathbf {x}'\)) such that

### Remark 5

Analogously as in the one-dimensional case, we cannot simply exchange the order of the limits above. Furthermore, we need to add terms fast enough so that \({{\hat{m}}} / L_k \rightarrow \infty \) when \(m,L_1,\ldots ,L_d \rightarrow \infty \).

### Corollary 6

As in the one-dimensional case, the uniform convergence of the prior covariance function also implies uniform convergence of the posterior mean and covariance in the limit \(m,L_1,\ldots ,L_d \rightarrow \infty \).

### 4.3 Scaling of error with increasing \({\hat{m}}\)

*d*. The latter term in turn depends on \({\hat{m}}\) and in that sense defines the scaling of error in the number of series terms.

It is worth noting that due to Remarks 17 and 20, we could actually tighten the bound by introducing \({\hat{m}}\)-dependence to *E*, but it does not affect the order of scaling, because the dependence on the dimensionality in that term is linear. Furthermore, the latter term actually depends on the ratio \({\hat{m}} / L\) and hence there is a coupling between the number of terms and the size of the domain *L*. However, we can still get idea of the convergence speed by fixing *L*.

Let us start by considering the case when \(S(\Vert {{{\omega }}}\Vert )\) is bounded by a reciprocal of a polynomial which is the case, for example, for the Matérn covariance function. We get the following theorem.

### Theorem 7

*D*such that \(S(||{\omega }||) \le \frac{D}{||{\omega }||^{d + a}}\) for some \(a > 0\). Then we have

*L*and

*d*).

### Proof

*m*, it is enough to investigate the scaling of the term \(\int _{\frac{\pi \, \hat{m}}{2L}}^{\infty } S(r) \, r^{d-1} \, \hbox {d}r\). We now get

The result in the above theorem tells that by selecting an appropriate differentiation order for the covariance function, we can make the convergence speed arbitrarily large. In particular, if we select \(a = d/2\), we get the Monte Carlo rate, and with \(a = d\), we get a convergence rate of \(\sim 1/m\).

### Theorem 8

*d*and

*L*).

### Proof

The above theorem tells that the convergence in the squared exponential case is faster than \(\sim 1 / m\), independently of the dimensionality *d*. It is worth noting though that the bound is not independent of the dimensionality in the sense that the constants do depend on it. Strictly speaking, the convergence rate is *h*(*d*) / *m*, for some function *h* which depends on *d*. However, as function of *m*, this rate is independent of the dimensionality.

### 4.4 Other domains

It would also be possible carry out similar convergence analysis, for example, in a spherical domain. In that case the technical details become slightly more complicated, because instead of sinusoidals we will have Bessel functions and the eigenvalues no longer form a uniform grid. This means that instead of Riemann integrals we need to consider weighted integrals where the distribution of the zeros of Bessel functions is explicitly accounted for. It might also be possible to use some more general theoretical results from mathematical analysis to obtain the convergence results. However, due to these technical challenges more general convergence proof will be developed elsewhere.

There is also a similar technical challenge in the analysis when the basis functions are formed by assuming an input density (see Sect. 2.4) instead of a bounded domain. Because explicit expressions for eigenfunctions and eigenvalues cannot be obtained in general, the elementary proof methods which we used here cannot be applied. Therefore the convergence analysis of this case is also left as a topic for future research.

## 5 Relationship to other methods

In this section, we compare our method to existing sparse GP methods from a theoretical point of view. We consider two different classes of approaches: a class of inducing input methods based on the Nyström approximation (following the interpretation of Quiñonero-Candela and Rasmussen 2005b; Bui et al. 2017) and direct spectral approximations.

### 5.1 Methods from the Nyström family

*m*inducing inputs \(\mathbf {x}_{u}\) and scaling the corresponding eigendecomposition of their corresponding covariance matrix \(\mathbf {K}_{{u},{u}}\) to match that of the actual covariance. The Nyström approximations to the

*j*th eigenvalue and eigenfunction are

*j*th eigenvalue and eigenvector of \(\mathbf {K}_{{u},{u}}\). This scheme was originally introduced to the GP context by Williams and Seeger (2001). They presented a sparse scheme, where the resulting approximate prior covariance over the latent variables is \(\mathbf {K}_{{f},{u}} \mathbf {K}_{{u},{u}}^{-1} \mathbf {K}_{{u},{f}}\), which can be derived directly from Eqs. (72) to (73).

*Subset of Regressors*(SOR, Smola and Bartlett 2001) method uses the Nyström approximation scheme for approximating the whole covariance function,

*Sparse Pseudo-Input GP*by Snelson and Ghahramani 2006) is also based on the Nyström approximation but contains an additional diagonal term replacing the diagonal of the approximate covariance matrix with the values from the true covariance. The corresponding prior covariance function for FIC is thus

### 5.2 Direct spectral methods

In the case that \(\varOmega \) is not compact, but covers the whole \(\mathbb {R}^d\), and when the covariance function is homogeneous, the eigenvalues defined by (77) are no longer discrete, but they can only be expressed as the spectral density \(S({\omega })\) which can be seen as a continuum of eigenvalues. The eigenfunctions become complex exponentials, that is, sines and cosines—which in turn are a subset of eigenfunctions of Laplace operator. In this background, what (20) essentially says is that we can approximate the Mercer expansion (76) by using the basis consisting of the Laplacian eigenfunctions \(\varphi _j(\mathbf {x}) \approx \phi _j(\mathbf {x})\) and point-wise evaluations of the spectral density at the Laplacian eigenvalues \(\gamma _j \approx S(\sqrt{-\lambda _j})\).

Another related classical connection is to the works in the relationship of spline interpolation and Gaussian process priors (Wahba 1978; Kimeldorf and Wahba 1970; Wahba 1990). In particular, it is well known (see, e.g., Wahba 1990) that spline smoothing can be seen as Gaussian process regression with a specific choice of covariance function. The relationship of the spline regularization with Laplace operators then leads to series expansion representations that are closely related to the approximations considered here.

Contours for the sparse spectrum SSGP method are visualized in Fig. 3c. Here the spectral points were chosen at random following Lázaro-Gredilla (2010). Because the basis functions are spanned using both sines and cosines, the number of spectral points was \(h=8\) in order to match the rank \(m=16\). These results agree well with those presented in the Lázaro-Gredilla et al. (2010) for a one-dimensional example. For this particular set of spectral points, some directions of the contours happen to match the true values very well, while other directions are completely off. Increasing the rank from 16 to 100 would give comparable results to the other methods.

Recently Hensman et al. (2018) presented a variational Fourier feature approximation for Gaussian processes that was derived for the Matérn class of kernels, where the approximation structure is set up by a low-rank plus diagonal structure. The key differences here are the fully diagonal (independent) structure in the \(\mathbf {K}_{u,u}\) matrix (giving rise to additional speedup) and the generality of only requiring the spectral density function to be known.

While SSGP is based on a sparse spectrum, the reduced-rank method proposed in this paper aims to make the spectrum as “full” as possible at a given rank. While SSGP can be interpreted as a Monte Carlo integral approximation, the corresponding interpretation to the proposed method would be a numerical quadrature-based integral approximation (cf. the convergence proof in “Appendix A.2”). Figure 3d shows the same contours obtained by the proposed reduced-rank method. Here the eigendecomposition of the Laplace operator has been obtained for the square \(\varOmega = [-L,L] \times [-L,L]\) with Dirichlet boundary conditions. The contours match well with the full solution toward the middle of the domain. The boundary effects drive the process to zero, which is seen as distortion near the edges.

Figure 3e shows how extending the boundaries just by 25% and keeping the number of basis functions fixed at 16, gives good results. The last Fig. 3f corresponds to using a disk-shaped domain instead of the rectangular. The eigendecomposition of the Laplace operator is done in polar coordinates, and the Dirichlet boundary is visualized by a circle in the figure.

### 5.3 Structure exploiting and decomposition methods

Other methods for scalable Gaussian processes include many structure exploiting techniques that, similarly to general inducing input methods, aim to be agnostic to the choice of covariance function. They rather exploit the structure of the inputs (see Saatçi 2012, for discussion on Kronecker and Toeplitz algebra) and not the GP prior per se. Most notably, scalable kernel interpolation (SKI, Wilson and Nickisch 2015) is an inducing point method that achieves \(O(n + m \log m)\) time complexity and \(\mathcal {O}(n + m)\) space complexity. Through local cubic kernel interpolation, the SKI framework is used in KISS-GP (see Wilson and Nickisch 2015, for details) which uses Kronecker and Toeplitz algebra on grids of inducing inputs to speed up inference.

The computational complexity of the SKI approach scales cubically in the input dimenionality *d*. Other recent methods (e.g., Gardner et al. 2018; Izmailov et al. 2018) have reduced the time complexity to linear in *d* as well (e.g., \(\mathcal {O}(d n + d m \log m)\)). These methods typically leverage parallelization (well suited for GPU calculations) or iterative methods.

Furthermore, general methods form numerical linear algebra for approximately solving eigenvalue and singular value problems allow for fast low-rank decompositions. These methods ignore the kernel learning perspective, but can provide useful tools in practice. For example, the pivoted Cholesky decomposition (Harbrecht et al. 2012; Bach 2013) allows constructing a low-rank approximation to an \(n \times n\) positive definite matrix in \(O(n m^2)\) time. There are also methods for fast randomized singular value decompositions based on subsampled Hadamard transformations (e.g., Boutsidis and Gittens 2013), with some further details in Le et al. (2013). These methods provide speedup to the general linear algebraic problem, but ignore the well-structured nature of the specific application to Gaussian process regression with stationary prior covariance functions.

## 6 Experiments

In this section, we aim to test the convergence results of the method in practice, provide examples of the practical use of the proposed method and compare it against other methods that are typically used in a similar setting. We start with small simulated one-dimensional datasets and then provide more extensive comparisons by using real-world data. We also consider an example of data, where the input domain is the surface of a sphere, and conclude our comparison by using a very large dataset to demonstrate what possibilities the computational benefits open.

### 6.1 Variation of domain size

In addition to the theoretical analysis of approximation error, we provide a study of the effect of choosing the domain size. We set up an experiment where we simulate data (\(n=100\) and all results averaged over 10 independent draws) from GP priors with a squared exponential covariance function with unit hyperparameters and corrupting additive Gaussian noise with variance \(\sigma _{n}^2 = 0.1^2\). The inputs are chosen uniformly randomly in \([-\tilde{L},\tilde{L}]\) with \(\tilde{L}=1\). We study the effect of varying the boundary location \(L \in (1,10]\).

*m*. Even though the KL suggests there would be a single best choice for

*L*, the practical sensitivity to the choice of

*L*is low. Already for \(m=5\), the MSE in the posterior mean is \(10^{-5}\) (note that the data has unit magnitude scale) when

*L*is chosen one to two length-scales from the data boundary \(\tilde{L}\).

The Kullback–Leibler divergence between the approximative and exact GP posterior by varying the boundary *L* and keeping all other parameters fixed

**a** 256 data points generated from a GP with hyperparameters \((\sigma ^2,\ell ,\sigma _{n}^2) = (1^2, 0.1, 0.2^2)\), the full GP solution, and an approximate solution with \(m=32\). **b** Negative marginal likelihood curves for the signal variance \(\sigma ^2\), length-scale \(\ell \), and noise variance \(\sigma _{n}^2\)

### 6.2 Comparison study

Standardized mean squared error (SMSE) and mean standardized log loss (MSLL) results for the toy data (\(d=1\), \(n=256\)) from Fig. 5 and the precipitation data (\(d=2\), \(n=5776\)) evaluated by 10-fold cross-validation and averaged over ten repetitions. The evaluation time includes hyperparameter learning

We compare our solution to SOR, DTC, VAR and FIC using the implementations provided in the GPstuff software package (version 4.3.1, see Vanhatalo et al. 2013) for Mathworks Matlab. The sparse spectrum SSGP method (Lázaro-Gredilla et al. 2010) was implemented into the GPstuff toolbox for the comparisons.^{1} The reference implementation was modified such that also non-ARD covariances could be accounted for.

The *m* inducing inputs for SOR, DTC, VAR, and FIC were chosen at random as a subset from the training data and kept fixed between the methods. For low-dimensional inputs, this tends to lead to good results and avoid overfitting to the training data, while optimizing the input locations alongside hyperparameters becomes the preferred approach in high input dimensions (Quiñonero-Candela and Rasmussen 2005b). The results are averaged over ten repetitions in order to present the average performance of the methods. In Sects. 6.2 and 6.3, we used a Cartesian domain with Dirichlet boundary conditions for the new reduced-rank method. To avoid boundary effects, the domain was extended by 10% outside the inputs in each direction.

In the comparisons, we followed the guidelines given by Chalupka et al. (2013) for making comparisons between the actual performance of different methods. For hyperparameter optimization, we used the fminunc routine in Matlab with a Quasi-Newton optimizer. We also tested several other algorithms, but the results were not sensitive to the choice of optimizer. The optimizer was run with a termination tolerance of \(10^{-5}\) on the target function value and on the optimizer inputs. The number of required target function evaluations stayed fairly constant for all the comparisons, making the comparisons for the hyperparameter learning bespoke.

*L*/

*m*). Figure 5a shows the approximate GP solution. The mean estimate follows the exact GP mean, and the shaded region showing the 95% confidence area differs from the exact solution (dashed) only near the boundaries.

Interpolation of the yearly precipitation levels using reduced-rank GP regression. **a** The \(n=5776\) weather station locations. **b**, **c** The results for the full GP model and the new reduced-rank GP method

Figure 6a, b shows the SMSE and MSLL values for \(m=8,10,\ldots ,32\) inducing inputs and basis functions for the toy dataset from Fig. 5. The convergence of the proposed reduced rank method is fast and as soon as the number of eigenfunctions is large enough (\(m=20\)) to account for the short length scales, the approximation converges to the exact full GP solution (shown by the dashed line).

In this case, the SOR method that uses the Nyström approximation to directly approximate the spectrum of the full GP (see Sect. 5) seems to give good results. However, as the resulting approximation in SOR corresponds to a singular Gaussian distribution, the predictive variance is underestimated. This can be seen in Fig. 6b, where SOR seems to give better results than the full GP. These results are however due to the smaller predictive variance on the test set. DTC tries to fix this shortcoming of SOR—they are identical in other respects except predictive variance evaluation—and while SOR and DTC give identical results in terms of SMSE, they differ in MSLL. We also note that additional trace term in the marginal likelihood in VAR makes the likelihood surface flat, which explains the differences in the results in comparison to DTC.

The sparse spectrum SSGP method did not perform well on average. Still, it can be seen that it converges toward the performance of the full GP. The dependence on the number of spectral points differs from the rest of the methods, and a rank of \(m=32\) is not enough to meet the other methods. However, in terms of best case performance over the ten repetitions with different inducing inputs and spectral points, both FIC and SSGP outperformed SOR, DTC, and VAR. Because of its “dense spectrum” approach, the proposed reduced-rank method is not sensitive to the choice of spectral points, and thus, the performance remained the same between repetitions. In terms of variance over the 10-fold cross-validation folds, the methods in order of growing variance in the figure legend (the variance approximately doubling between FULL and SSGP).

### 6.3 Precipitation data

As a real-data example, we consider a precipitation data set that contain US annual precipitation summaries for year 1995 (\(d=2\) and \(n=5776\), available online, see Vanhatalo et al. 2013). The observation locations are shown on a map in Fig. 7a.

We limit the number of inducing inputs and spectral points to \(m=128,192,\ldots ,512\). For the our Hilbert-GP method, we additionally consider ranks \(m = 1024,1536, \ldots , 4096\), and show that this causes a computational burden of the same order as the conventional sparse GP methods with smaller *m*s. To avoid boundary effects, the domain was extended by 10% outside the inputs in each direction.

In order to demonstrate the computational benefits of the proposed model, we also present the running time of the GP inference (including hyperparameter optimization). All methods were implemented under a similar framework in the GPstuff package, and they all employ similar reformulations for numerical stability. The key difference in the evaluation times comes from hyperparameter optimization, where SOR, DTC, VAR, FIC, and SSGP scale as \(\mathcal {O}(nm^2)\) for each evaluation of the marginal likelihood. The proposed reduced-rank method scales as \(\mathcal {O}(m^3)\) for each evaluation (after an initial cost of \(\mathcal {O}(nm^2)\)).

Figure 6c, d shows the SMSE and MSLL results for this data against evaluation time. On this scale, we note that the evaluation time and accuracy, both in terms of SMSE and MSLL, are alike for SOR, DTC, VAR, and FIC. SSGP is faster to evaluate in comparison with the Nyström family of methods, which comes from the simpler structure of the approximation. Still, the number of required spectral points to meet a certain average error level is larger for SSGP.

The results for the proposed reduced-rank method(Hilbert-GP) show that with two input dimensions, the required number of basis functions is larger. For the first seven points, we notice that even though the evaluation is two orders of magnitude faster, the method performs only slightly worse in comparison with conventional sparse methods. By considering higher ranks (the next seven points), our method converges to the performance of the full GP (both in SMSE and MSLL), while retaining a computational time comparable to the conventional methods. This type of spatial medium-size GP regression problems can thus be solved in seconds.

**a** Modeling of the yearly mean temperature on the spherical surface of the Earth (\(n = 11{,}028\)). **b** The standard deviation contours which match well with the continents

### 6.4 Temperature data on the surface of the globe

We also demonstrate the use of the method in non-Cartesian coordinates. We consider modeling of the spatial mean temperature over a number of \(n = 11{,}028\) locations around the globe.^{2}

As earlier demonstrated in Fig. 2, we use the Laplace operator in spherical coordinates as defined in (31). The eigenfunctions for the angular part are the Laplace’s spherical harmonics. The evaluation of the approximation does not depend on the coordinate system, and thus, all the equations presented in the earlier sections remain valid. We use the squared exponential covariance function and \(m = 1089\) basis functions.

Figure 8 visualizes the modeling outcome. The results are visualized using an interrupted projection (an adaption of the Goode homolosine projection) in order to preserve the length-scale structure across the map. Fig. 8a shows the posterior mean temperature. The uncertainty is visualized in Fig. 8b, which corresponds to the \(n = 11{,}028\) observation locations that are mostly spread over the continents and Western countries (the white areas in Fig. 8b contain no observations). Obtaining the reduced-rank result (including initialization and hyperparameter learning) took approximately 50 s on a laptop computer (MacBook Air, Late 2010 model, 2.13 GHz, 4 GB RAM), which scales with *n* in comparison to the evaluation time in the previous section.

### 6.5 Additive modeling of airline delays

In order to fully use the computational benefits and also underline a way of applying the method to high-dimensional inputs, we consider a large dataset for predicting airline delays. The US flight delay prediction example (originally considered by Hensman et al. 2013) is a standard test data set in Gaussian process regression. This is due to the clearly non-stationary behavior and its massive size, with nearly 6 million records.

We aim to replicate and extend to the results previously presented in the work by Hensman et al. (2018) for the Variational Fourier Features (VFF) method. This example has also been used by Deisenroth and Ng (2015), where it was solved using distributed Gaussian processes, and by Samo and Roberts (2016) who use this example for demonstrating the computational efficiency of string Gaussian processes. Adam et al. (2016) used this dataset as an example where the model can be formed by the addition of multiple underlying components.

The data consists of flight arrival and departure times for every commercial flight in the USA for the year 2008. We use the standard eight covariates \(\mathbf {x}\) (see Hensman et al. 2013) which are the age of the aircraft (number of years since deployment), route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. The target is to predict the delay of the aircraft at landing (in minutes), *y*.

We consider several subset sizes of the data, each selected uniformly at random: \(n = 10{,}000\), 100,000, 1,000,000, and 5,929,413 (all data). In each case, two-thirds of the data is used for training and one third for testing. For each subset size, the training is repeated ten times. The random splits are exactly the same as in Hensman et al. (2018).

Table 1 shows the (normalized) predictive mean squared errors (MSEs) and the negative log predictive densities (NLPDs) with one standard deviation on the airline arrival delays experiment. The table shows that the Hilbert-GP method is directly on par with the variational Fourier features (VFF) method. For the smaller subsets, some variability in the results is visible, even though the MSEs and NLPDs are within one standard deviation of one another for VFF and Hilbert-GP. For the datasets in the millions, VFF and Hilbert-GP perform practically equally well. Further analysis and interpretation of the data and model can be found in Hensman et al. (2018). We have omitted reporting results for the String GP method (Samo and Roberts 2016), the Bayesian committee machine (BCM, Tresp 2000), and the robust Bayesian committee machine (rBCM, Deisenroth and Ng 2015). Each of these performed worse than any of the included methods, and the resulting numbers can be found listed in Hensman et al. (2018) and Samo and Roberts (2016).

Predictive mean squared errors (MSEs) and negative log predictive densities (NLPDs) with one standard deviation on the airline arrival delays experiment (input dimensionality \(d=8\)) for a number of data points ranging up to almost 6 million

| 10,000 | 100,000 | 1,000,000 | 5,929,413 | ||||
---|---|---|---|---|---|---|---|---|

MSE | NLPD | MSE | NLPD | MSE | NLPD | MSE | NLPD | |

Hilbert-GP | \(0.97 \pm 0.14\) | \(1.404 \pm 0.071\) | \(0.80 \pm 0.06\) | \(1.311 \pm 0.038\) | \(0.83 \pm 0.02\) | \(1.329 \pm 0.011\) | \(0.827 \pm 0.005\) | \(1.324 \pm 0.003\) |

VFF | 0.89 ± 0.15 | 1.362 ± 0.091 | 0.82 ± 0.05 | 1.319 ± 0.030 | 0.83 ± 0.01 | 1.326 ± 0.008 | 0.827 ± 0.004 | 1.324 ± 0.003 |

SVIGP | 0.89 ± 0.16 | 1.354 ± 0.096 | 0.79 ± 0.05 | 1.299 ± 0.033 | 0.79 ± 0.01 | 1.301 ± 0.009 | 0.791 ± 0.005 | 1.300 ± 0.003 |

Full-RBF | 0.89 ± 0.16 | 1.349 ± 0.098 | N/A | N/A | N/A | N/A | N/A | N/A |

Full-additive | 0.89 ± 0.16 | 1.362 ± 0.096 | N/A | N/A | N/A | N/A | N/A | N/A |

### 6.6 Gaussian process driven Poisson equation

Gaussian process inference on the Poisson equation

Figure 9 shows the result of applying the proposed method to this model with the input function shown in Fig. 9b. The true solution and the simulated measurements (with standard deviation of 1 / 10) are shown in Fig. 9a. The scale \(\sigma ^2\) and length scale \(\ell \) of the SE covariance function were estimated by maximum likelihood method, and the number of basis functions used for solving the GP regression problem was 100 (for simulation we used 255 basis functions). The estimates of the input and the solution function are shown in Fig. 9a, b, respectively. As can be seen in the figures, the estimate of the solution is very good, as can be expected from the fact that we obtain direct (although noisy) measurements from it. The estimate of the input is less accurate, but still approximates the true input well.

## 7 Conclusion and discussion

In this paper, we have proposed a novel approximation scheme for forming approximate eigendecompositions of covariance functions in terms of the Laplace operator eigenbasis and the spectral density of the covariance function. The eigenfunction decomposition of the Laplacian can easily be formed in various domains, and the eigenfunctions are independent of the choice of hyperparameters of the covariance.

An advantage of the method is that it has the ability to approximate the eigendecomposition using only the eigendecomposition of the Laplacian and the spectral density of the covariance function, both of which are closed-form expressions. This together with having the eigenvectors in \(\varvec{\Phi }\) mutually orthogonal and independent of the hyperparameters, is the key to efficiency. This allows an implementation with a computational cost of \(\mathcal {O}(nm^2)\) (initial) and \(\mathcal {O}(m^3)\) (marginal likelihood evaluation), with negligible memory requirements.

Of the infinite number of possible basis functions, only an extremely small subset are of any relevance to the GP being approximated. In GP regression, the model functions are conditioned on a covariance function (kernel), which imposes desired properties on the solutions. We choose the basis functions such that they are as close as possible (w.r.t. the Frobenius norm) to those of the particular covariance function. Our method gives the exact eigendecomposition of a GP that has been constrained to be zero at the boundary of the domain.

*learning curve*estimation, the eigenvalues of the Gaussian process can now be directly approximated. For example, we can approximate the Opper–Vivarelli bound (Opper and Vivarelli 1999) as

However, some of these abilities come with a cost. As demonstrated throughout the paper, restraining the domain to boundary conditions introduces edge effects. These are, however, known and can be accounted for. Extrapolating with a stationary covariance function outside the training inputs only causes the predictions to revert to the prior mean and variance. Therefore, we consider the boundary effects a minor problem for practical use.

Although at first sight the method appears to have a bad (exponential) scaling with respect to the input dimensionality, as shown by the analysis in Sect. 4.3, this is not true. By increasing the differentiability order of the covariance function appropriately we can keep the convergence rate at the level \({\sim }1/m^a\), for a given constant \(a > 0\) and with total of *m* terms in the series, regardless of the input dimensionality. Furthermore, Theorem 8 shows that for squared exponential covariance function the convergence rate is always better than \({\sim }1/m\), independently of the input dimensionality. Further resources related to the proposed method and implementation details in form of code are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_AaltoML_hilbert-2Dgp&d=DwIF-g&c=vh6FgFnduejNhPPD0fl_yRaSfZy8CWbWnIf4XJhSqx8&r=tr37p-LMKuZcfSC3Gl2yDumEEj4eKb1_KBfWD90OLbA&m=THRRZB0_Y9lwmhCaOrOo0bdjMjd0OsCoUzaXp0KoxtY&s=3V5psUm6EyKZqu53hd-Aij4hjaYtsPliYE1xSoBxYlQ&e=.

## Footnotes

- 1.
The implementation is based on the code available from Miguel Lázaro-Gredilla: http://www.tsc.uc3m.es/~miguel/downloads.php.

- 2.
The data are available for download from US National Climatic Data Center: http://www7.ncdc.noaa.gov/CDO/cdoselect.cmd (accessed January 3, 2014).

## Notes

### Acknowledgements

Open access funding provided by Aalto University. The authors would like to thank James Hensman and Manon Kok for feedback on an early version of this paper, Mauricio Álvarez for help in the latent force model application as well as the referees for providing valuable ideas for improving the article. This research was supported by the Academy of Finland grants 308640 and 313708. We acknowledge the computational resources provided by the Aalto Science-IT project.

## References

- Adam, V., Hensman, J., Sahani, M.: Scalable transformed additive signal decomposition by non-conjugate Gaussian process inference. In: IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2016)Google Scholar
- Adler, R.J.: The Geometry of Random Fields, vol. 62. SIAM, Philadelphia (1981)zbMATHGoogle Scholar
- Akhiezer, N.I., Glazman, I.M.: Theory of Linear Operators in Hilbert Space. Dover, New York (1993)zbMATHGoogle Scholar
- Álvarez, M.A., Luengo, D., Lawrence, N.D.: Linear latent force models using Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell.
**35**(11), 2693–2705 (2013)CrossRefGoogle Scholar - Bach, F.: Sharp analysis of low-rank kernel matrix approximations. In: Proceedings of the 26th Annual Conference on Learning Theory (COLT), PMLR, Princeton, NJ, USA, Proceedings of Machine Learning Research, vol. 30, pp. 185–209 (2013)Google Scholar
- Baker, C.T.H.: The Numerical Treatment of Integral Equations. Clarendon Press, Oxford (1977)zbMATHGoogle Scholar
- Boutsidis, C., Gittens, A.: Improved matrix algorithms via the subsampled randomized Hadamard transform. SIAM J. Matrix Anal. Appl.
**34**(3), 1301–1340 (2013)MathSciNetCrossRefzbMATHGoogle Scholar - Brooks, S., Gelman, A., Jones, G.L., Meng, X.L.: Handbook of Markov Chain Monte Carlo. Chapman & Hall, London (2011)CrossRefzbMATHGoogle Scholar
- Bui, T.D., Yan, J., Turner, R.E.: A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. J. Mach. Learn. Res.
**18**(104), 1–72 (2017)MathSciNetzbMATHGoogle Scholar - Chalupka, K., Williams, C.K.I., Murray, I.: A framework for evaluating approximation methods for Gaussian process regression. J. Mach. Learn. Res.
**14**, 333–350 (2013)MathSciNetzbMATHGoogle Scholar - Courant, R., Hilbert, D.: Methods of Mathematical Physics, vol. 1. Wiley, Hoboken (2008)zbMATHGoogle Scholar
- Cramér, H., Leadbetter, M.R.: Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications. Dover, Mineola (2013)zbMATHGoogle Scholar
- Csató, L., Opper, M.: Sparse online Gaussian processes. Neural Comput.
**14**(3), 641–668 (2002)CrossRefzbMATHGoogle Scholar - Da Prato, G., Zabczyk, J.: Stochastic Equations in Infinite Dimensions, Encyclopedia of Mathematics and Its Applications, vol. 45. Cambridge University Press, Cambridge (1992)CrossRefzbMATHGoogle Scholar
- Deisenroth, M.P., Ng, J.W.: Distributed Gaussian processes. In: International Conference on Machine Learning (ICML), pp. 1481–1490 (2015)Google Scholar
- Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B
**195**(2), 216–222 (1987)MathSciNetCrossRefGoogle Scholar - Feller, W.: An Introduction to Probability Theory and Its Applications, vol. I, 3rd edn. Wiley, Hoboken (1968)zbMATHGoogle Scholar
- Fritz, J., Neuweiler, I., Nowak, W.: Application of FFT-based algorithms for large-scale universal Kriging problems. Math. Geosci.
**41**(5), 509–533 (2009)CrossRefzbMATHGoogle Scholar - Gal, Y., Turner, R.: Improving the Gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. In: Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 655–664 (2015)Google Scholar
- Gardner, J., Pleiss, G., Wu, R., Weinberger, K., Wilson, A.: Product kernel interpolation for scalable Gaussian processes. In: Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, PMLR, Playa Blanca, Lanzarote, Canary Islands, vol. 84, pp. 1407–1416 (2018)Google Scholar
- Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)zbMATHGoogle Scholar
- Harbrecht, H., Peters, M., Schneider, R.: On the low-rank approximation by the pivoted Cholesky decomposition. Appl. Numer. Math.
**4**(62), 428–440 (2012)MathSciNetCrossRefzbMATHGoogle Scholar - Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. In: Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI 2013), pp. 282–290 (2013)Google Scholar
- Hensman, J., Durrande, N., Solin, A.: Variational Fourier features for Gaussian processes. J. Mach. Learn. Res.
**8**(151), 1–52 (2018)zbMATHGoogle Scholar - Izmailov, P., Novikov, A., Kropotov, D.: Scalable Gaussian processes with billions of inducing inputs via tensor train decomposition. In: Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (ICML), Proceedings of Machine Learning Research, PMLR, Playa Blanca, Lanzarote, Canary Islands, vol. 84, pp. 726–735 (2018)Google Scholar
- Kaipio, J., Somersalo, E.: Statistical and Computational Inverse Problems. Springer, Berlin (2005)zbMATHGoogle Scholar
- Kimeldorf, G.S., Wahba, G.: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat.
**41**, 495–502 (1970)MathSciNetCrossRefzbMATHGoogle Scholar - Lázaro-Gredilla, M.: Sparse Gaussian processes for large-scale machine learning. Ph.D. Thesis, Universidad Carlos III de Madrid, Madrid, Spain (2010)Google Scholar
- Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C.E., Figueiras-Vidal, A.R.: Sparse spectrum Gaussian process regression. J. Mach. Learn. Res.
**11**, 1865–1881 (2010)MathSciNetzbMATHGoogle Scholar - Le, Q., Sarlos, T., Smola, A.: Fastfood—computing Hilbert space expansions in loglinear time. In: Proceedings of the 30th International Conference on Machine Learning (ICML), PMLR, Atlanta, Georgia, USA, Proceedings of Machine Learning Research, vol. 28, pp. 244–252 (2013)Google Scholar
- Lenk, P.J.: Towards a practicable Bayesian nonparametric density estimator. Biometrika
**78**(3), 531–543 (1991)MathSciNetCrossRefzbMATHGoogle Scholar - Lindgren, F., Rue, H., Lindström, J.: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**73**(4), 423–498 (2011)MathSciNetCrossRefzbMATHGoogle Scholar - Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, New York (2001)zbMATHGoogle Scholar
- Loève, M.: Probability Theory. The University Series in Higher Mathematics, 3rd edn. Van Nostrand, Princeton (1963)zbMATHGoogle Scholar
- Neal, R.M.: MCMC using Hamiltonian dynamics. In: Brooks, S., Gelman, A., Jones, G.L., Meng, X.L. (eds.) Handbook of Markov Chain Monte Carlo, Chap 5. Chapman & Hall, London (2011)Google Scholar
- Opper, M., Vivarelli, F.: General bounds on Bayes errors for regression with Gaussian processes. Adv. Neural Inf. Process. Syst.
**11**, 302–308 (1999)Google Scholar - Paciorek, C.J.: Bayesian smoothing with Gaussian processes using Fourier basis functions in the spectralGP package. J. Stat. Softw.
**19**(2), 1–38 (2007)CrossRefGoogle Scholar - Quiñonero-Candela, J., Rasmussen, C.E.: Analysis of some methods for reduced rank Gaussian process regression. In: Switching and Learning in Feedback Systems, Lecture Notes in Computer Science, vol. 3355, pp. 98–127. Springer (2005)Google Scholar
- Quiñonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res.
**6**, 1939–1959 (2005)MathSciNetzbMATHGoogle Scholar - Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006)zbMATHGoogle Scholar
- Saatçi, Y.: Scalable inference for structured Gaussian process models. Ph.D. Thesis, University of Cambridge, UK (2012)Google Scholar
- Samo, Y.L.K., Roberts, S.J.: String and membrane Gaussian processes. J. Mach. Learn. Res.
**17**, 1–87 (2016)MathSciNetzbMATHGoogle Scholar - Särkkä, S.: Linear operators and stochastic partial differential equations in Gaussian process regression. In: Proceedings of ICANN (2011)Google Scholar
- Särkkä, S., Hartikainen, J.: Infinite-dimensional Kalman filtering approach to spatio-temporal Gaussian process regression. In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR Workshop and Conference Proceedings, vol. 22, pp. 993–1001 (2012)Google Scholar
- Särkkä, S., Piché, R.: On convergence and accuracy of state-space approximations of squared exponential covariance functions. In: Proceedings of MLSP, pp. 1–6 (2014)Google Scholar
- Särkkä, S., Solin, A., Hartikainen, J.: Spatiotemporal learning via infinite-dimensional Bayesian filtering and smoothing. IEEE Signal Process. Mag.
**30**(4), 51–61 (2013)CrossRefGoogle Scholar - Seeger, M., Williams, C.K.I., Lawrence, N.D.: Fast forward selection to speed up sparse Gaussian process regression. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics (AISTATS) (2003)Google Scholar
- Showalter, R.E.: Hilbert Space Methods in Partial Differential Equations. Dover Publications, Mineola (2010)Google Scholar
- Shubin, M.A.: Pseudodifferential Operators and Spectral Theory. Springer Series in Soviet Mathematics. Springer, Berlin (1987)Google Scholar
- Smola, A.J., Bartlett, P.: Sparse greedy Gaussian process regression. In: Advances in Neural Information Processing Systems, vol. 13 (2001)Google Scholar
- Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. Adv. Neural Inf. Process. Syst.
**18**, 1259–1266 (2006)Google Scholar - Sollich, P., Halees, A.: Learning curves for Gaussian process regression: approximations and bounds. Neural Comput.
**14**(6), 1393–1428 (2002)CrossRefzbMATHGoogle Scholar - Tarantola, A.: Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM, Philadelphia (2004)zbMATHGoogle Scholar
- Titsias, M.K.: Variational learning of inducing variables in sparse Gaussian processes. In: Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR Workshop and Conference Proceedings, vol. 5, pp. 567–574 (2009)Google Scholar
- Tresp, V.: A Bayesian committee machine. Neural Comput.
**12**(11), 2719–2741 (2000)CrossRefGoogle Scholar - Van Trees, H.L.: Detection, Estimation, and Modulation Theory Part I. Wiley, New York (1968)zbMATHGoogle Scholar
- Vanhatalo, J., Pietiläinen, V., Vehtari, A.: Approximate inference for disease mapping with sparse Gaussian processes. Stat. Med.
**29**(15), 1580–1607 (2010)MathSciNetGoogle Scholar - Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P., Tolvanen, V., Vehtari, A.: GPstuff: Bayesian modeling with Gaussian processes. J. Mach. Learn. Res.
**14**, 1175–1179 (2013)MathSciNetzbMATHGoogle Scholar - Wahba, G.: Improper priors, spline smoothing and the problem of guarding against model errors in regression. J. R. Stat. Soc. Ser. B (Methodol.)
**40**, 364–372 (1978)MathSciNetzbMATHGoogle Scholar - Wahba, G.: Spline Models for Observational Data. SIAM, Philadelphia (1990)CrossRefzbMATHGoogle Scholar
- Williams, C.K.I., Seeger, M.: The effect of the input density distribution on kernel-based classifiers. In: Proceedings of the 17th International Conference on Machine Learning (2000)Google Scholar
- Williams, C.K.I., Seeger, M.: Using the Nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems, vol. 13 (2001)Google Scholar
- Wilson, A., Nickisch, H.: Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In: Proceedings of the 32nd International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, PMLR, Lille, France, vol. 37, pp. 1775–1784 (2015)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.