Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

This paper proposes a novel scheme for reduced-rank Gaussian process regression. The method is based on an approximate series expansion of the covariance function in terms of an eigenfunction expansion of the Laplace operator in a compact subset of $\mathbb{R}^d$. On this approximate eigenbasis the eigenvalues of the covariance function can be expressed as simple functions of the spectral density of the Gaussian process, which allows the GP inference to be solved under a computational cost scaling as $\mathcal{O}(nm^2)$ (initial) and $\mathcal{O}(m^3)$ (hyperparameter learning) with $m$ basis functions and $n$ data points. The approach also allows for rigorous error analysis with Hilbert space theory, and we show that the approximation becomes exact when the size of the compact subset and the number of eigenfunctions go to infinity. The expansion generalizes to Hilbert spaces with an inner product which is defined as an integral over a specified input density. The method is compared to previously proposed methods theoretically and through empirical tests with simulated and real data.


Introduction
Gaussian processes (GPs, Rasmussen and Williams, 2006) are powerful tools for nonparametric Bayesian inference and learning. In GP regression the model functions f (x) are assumed to be realizations from a Gaussian random process prior with a given covariance function k(x, x ), and learning amounts to solving the posterior process given a set of noisy measurements y 1 , y 2 , . . . , y n at some given test inputs. This model is often written in the form f ∼ GP(0, k(x, x )), where ε i ∼ N (0, σ 2 n ), for i = 1, 2, . . . , n. One of the main limitations of GPs in machine learning is the computational and memory requirements that scale as O(n 3 ) and O(n 2 ) in a direct implementation. This limits the applicability of GPs when the number of training samples n grows large. The computational requirements arise because in solving the GP regression problem we need to invert the n×n Gram matrix K+σ 2 n I, where K ij = k(x i , x j ), which is an O(n 3 ) operation in general.
To overcome this problem, over the years, several schemes have been proposed. They typically reduce the storage requirements to O(nm) and complexity to O(nm 2 ), where m < n. Some early methods have been reviewed in Rasmussen and Williams (2006), and Quiñonero-Candela and Rasmussen (2005a) provide a unifying view on several methods. From a spectral point of view, several of these methods (e.g., SOR, DTC, VAR, FIC) can be interpreted as modifications to the so-called Nyström method (see Baker, 1977;Williams and Seeger, 2001), a scheme for approximating the eigenspectrum.
For stationary covariance functions the spectral density of the covariance function can be employed: in this context the spectral approach has mainly been considered in regular grids, as this allows for the use of FFT-based methods for fast solutions (see Paciorek, 2007;Fritz et al., 2009), and more recently in terms of converting GPs to state space models (Särkkä and Hartikainen, 2012;Särkkä et al., 2013). Recently,  proposed a sparse spectrum method where a randomly chosen set of spectral points span a trigonometric basis for the problem.
The methods proposed in this article fall into the class of methods called reduced-rank approximations (see, e.g., Rasmussen and Williams, 2006) which are based on approximating the Gram matrix K with a matrixK with a smaller rank m < n. This allows for the use of matrix inversion lemma (Woodbury formula) to speed up the computations. It is well-known that the optimal reduced-rank approximation of the Gram (covariance) matrix K with respect to the Frobenius norm isK = ΦΛΦ T , where Λ is a diagonal matrix of the leading m eigenvalues of K and Φ is the matrix of the corresponding orthonormal eigenvectors (Golub and Van Loan, 1996;Rasmussen and Williams, 2006). Yet, as computing the eigendecomposition is an O(n 3 ) operation, this provides no remedy as such.
In this work we propose a novel method for obtaining approximate eigendecompositions of covariance functions in terms of an eigenfunction expansion of the Laplace operator in a compact subset of R d . The method is based on interpreting the covariance function as the kernel of a pseudo-differential operator (Shubin, 1987) and approximating it using Hilbert space methods (Courant and Hilbert, 2008;Showalter, 2010). This results in a reduced-rank approximation for the covariance function. This path has not been explored in GP regression context before, although the approach is closely related to the stochastic partial differential equation based methods recently introduced to spatial statistics and GP regression (Lindgren et al., 2011;Särkkä and Hartikainen, 2012;Särkkä et al., 2013). We also show how the solution formally converges to the exact solution in well-defined conditions, and provide theoretical and experimental comparisons to existing state-of-the-art methods. This paper is structured as follows: In Section 2 we derive the approximative series expansion of the covariance functions. Section 3 is dedicated to applying the approximation scheme to GP regression and providing details of the computational benefits. We provide a detailed analysis of the convergence of the method in Section 4. Section 5 and 6 provide comparisons to existing methods, the former from a more theoretical point of view, whereas the latter contains examples and comparative evaluation on several datasets. Finally the properties of the method are summarized and discussed in Section 7.

Approximating the Covariance Function
In this section, we start by stating the assumptions and properties of the class of covariance functions that we are considering, and show how a homogenous covariance function can be considered as a pseudo-differential operator constructed as a series of Laplace operators. Then we show how the pseudo-differential operators can be approximated with Hilbert space methods on compact subsets of R d or via inner products with integrable weight functions, and discuss connections to Sturm-Liouville theory.

Spectral Densities of Homogeneous and Isotropic Gaussian Processes
In this work it is assumed that the covariance function is homogeneous (stationary), which means that the covariance function k(x, x ) is actually a function of r = x − x only. This means that the covariance structure of the model function f (x) is the same regardless of the absolute position in the input space (cf. Rasmussen and Williams, 2006). In this case the covariance function can be equivalently represented in terms of the spectral density. This results from the Bochner's theorem (see, e.g., Akhiezer and Glazman, 1993;Da Prato and Zabczyk, 1992) which states that an arbitrary positive definite function k(r) can be represented as where µ is a positive measure. If the measure µ(ω) has a density, it is called the spectral density S(ω) corresponding to the covariance function k(r). This gives rise to the Fourier duality of covariance and spectral density, which is known as the Wiener-Khintchin theorem (Rasmussen and Williams, 2006), giving the identities k(r) = 1 (2π) d S(ω) e i ω T r dω and S(ω) = k(r) e −i ω T r dr. ( From these identities it is easy to see that if the covariance function is isotropic, that is, it only depends on the Euclidean norm r such that k(r) k( r ), then the spectral density will also only depend on the norm of ω such that we can write S(ω) S( ω ). In the following we assume that the considered covariance functions are indeed isotropic, but the approach can be generalized to more general homogenous covariance functions.

The Covariance Operator As a Pseudo-Differential Operator
Associated to each covariance function k(x, x ) we can also define a covariance operator K as follows: As we show in the next section, this interpretation allows us to approximate the covariance operator using Hilbert space methods which are typically used for approximating differential and pseudo-differential operators in the context of partial differential equations (Showalter, 2010). When the covariance function is homogenous, the corresponding operator will be translation invariant thus allowing for Fourier-representation as a transfer function. This transfer function is just the spectral density of the Gaussian process.
Consider an isotropic covariance function k(x, x ) k( r ) (recall that · denotes the Euclidean norm). The spectral density of the Gaussian process and thus the transfer function corresponding to the covariance operator will now have the form S( ω ). We can formally write it as a function of ω 2 such that Assume that the spectral density S(·) and hence ψ(·) are regular enough so that the spectral density has the following polynomial expansion: Thus we also have Recall that the transfer function corresponding to the Laplacian operator ∇ 2 is − ω 2 in the sense that where F [·] denotes the Fourier transform of its argument. If we take the Fourier transform of (7), we get the following representation for the covariance operator K, which defines a pseudo-differential operator (Shubin, 1987) as a formal series of Laplacian operators: In the next section we will use this representation to form a series expansion approximation for the covariance function.

Hilbert-Space Approximation of the Covariance Operator
We will now form a Hilbert-space approximation for the pseudo-differential operator defined by (9). Let Ω ⊂ R d be a compact set, and consider the eigenvalue problem for the Laplacian operators with Dirichlet boundary conditions (we could use other boundary conditions as well): Because −∇ 2 is a positive definite Hermitian operator, the set of eigenfunctions φ j (·) is orthonormal with respect to the inner product and all the eigenvalues λ j are real and positive. The negative Laplace operator can then be assigned the formal kernel in the sense that for sufficiently (weakly) differentiable functions f in the domain Ω assuming Dirichlet boundary conditions. If we consider the formal powers of this representation, due to orthonormality of the basis, we can write the arbitrary operator power s = 1, 2, . . . of the kernel as This is again to be interpreted to mean that for regular enough functions f and in the current domain with the assumed boundary conditions. This implies that on the domain Ω, assuming the boundary conditions, we also have The left hand side is just K f via (9), on the domain with the boundary conditions, and thus by comparing to (4) and using (15) we can conclude that which is only an approximation to the covariance function due to restriction of the domain to Ω and the boundary conditions. By letting ω 2 = λ j in (7) we now obtain and substituting this into (18) then leads to the approximation where S(·) is the spectral density of the covariance function, λ j is the jth eigenvalue and φ j (·) the eigenfunction of the Laplace operator in a given domain. These expressions tend to be simple closed-form expressions.
The right hand side of (20) is very easy to evaluate, because it corresponds to evaluating the spectral density at the square roots of the eigenvalues and multiplying them with the eigenfunctions of the Laplace operator. Because the eigenvalues of the Laplacian operator are monotonically increasing with j and for bounded covariance functions the spectral density goes to zero fast with higher frequencies, we can expect to obtain a good approximation of the right hand side by retaining only a finite number of terms in the series. However, even with an infinite number of terms this is only an approximation, because we assumed a compact domain with boundary conditions. The approximation can be, though, expected to be good at the input values which are not near the boundary of Ω, where the Laplacian was taken to be zero.
As an example, Figure 1 shows Matérn covariance functions of various degrees of smoothness ν (see, e.g., Rasmussen and Williams, 2006) and approximations for different numbers of basis functions in the approximation. The basis consists of the functions φ j (x) = L −1/2 sin(πj(x + L)/(2L)) and the eigenvalues were λ j = (π j/(2L)) 2 with L = 1 and = 0.1. For the squared exponential the approximation is indistinguishable from the exact curve already at m = 12, whereas the less smooth functions require more terms.

Inner Product Point of View
Instead of considering a compact bounded set Ω, we can consider the same approximation in terms of an inner product defined by an input density (Williams and Seeger, 2000). Let the inner product be defined as where w(x) is some positive weight function such that w(x) dx < ∞. In terms of this inner product, we define the operator This operator is self-adjoint with respect to the inner product, Kf, g = f, Kg , and according to the spectral theorem there exists an orthonormal set of basis functions and positive constants, {ϕ j (x), γ j | j = 1, 2, . . .}, that satisfies the eigenvalue equation Thus k(x, x ) has the series expansion Similarly, we also have the Karhunen-Loeve expansion for a sample function f (x) with zero mean and the above covariance function: where f j s are independent zero mean Gaussian random variables with variances γ j (see, e.g., Lenk, 1991). For the negative Laplacian the corresponding definition is which implies and the operator defined by (26) can be seen to be self-adjoint. Again, there exists an orthonormal basis {φ j (x) | j = 1, 2, . . .} and positive eigenvalues λ j which satisfy the eigenvalue equation Thus the kernel of D has a series expansion similar to Equation (13) and thus an approximation can be given in the same form as in Equation (20). In this case the approximation error comes from approximating the Laplacian operator with the more smooth operator, which is closely related to assumption of an input density w(x) for the Gaussian process. However, when the weight function w(·) is close to constant in the area where the inputs points are located, the approximation is accurate.

Connection to Sturm-Liouville Theory
The presented methodology is also related to the Sturm-Liouville theory arising in the theory of partial differential equations (Courant and Hilbert, 2008). When the input x is scalar, the eigenvalue problem in Equation (23) can be written in Sturm-Liouville form as follows: The above equation can be solved for φ j (x) and λ j using numerical methods for Sturm-Liouville equations. Also note that if we select w(x) = 1 in a finite set, we obtain the (a) ν = 1 2 and = 0.5 (b) ν = 3 2 and = 0.5 (c) ν → ∞ and = 0.5 Figure 2: Approximate random draws of Gaussian processes with the Matérn covariance function on the hull of a unit sphere. The color scale and radius follow the process.
which is compatible with the results in the previous section.
We consider the case where x ∈ R d and w(x) is symmetric around the origin and thus is only a function of the norm r = x (i.e. has the form w(r)). The Laplacian in spherical coordinates is where ∆ S d−1 is the Laplace-Beltrami operator on S d−1 . Let us assume that φ j (r, ξ) = h j (r) g(ξ), where ξ denotes the angular variables. After some algebra, writing the equations into Sturm-Liouville form yields for the radial part and ∆ S d−1 g(ξ) = 0 for the angular part. The solutions to the angular part are the Laplace's spherical harmonics. Note that if we assume that we have w(r) = 1 on some area of finite radius, the first equation becomes (when d > 1): Figure 2 shows example Gaussian random field draws on a unit sphere, where the basis functions are the Laplace spherical harmonics and the covariance functions of the Matérn class with different degrees of smoothness ν. Our approximation is straight-forward to apply in any domain, where the eigendecomposition of the Laplacian can be formed.

Application of the Method to GP Regression
In this section we show how the approximation (20) can be used in Gaussian process regression. We also write down the expressions needed for hyperparameter learning and discuss the computational requirements of the methods.

Gaussian Process Regression
GP regression is usually formulated as predicting an unknown scalar output f (x * ) associated with a known input x * ∈ R d , given a training data set D = {(x i , y i ) | i = 1, 2, . . . , n}.
The model functions f are assumed to be realizations of a Gaussian random process prior and the observations corrupted by Gaussian noise: where ε i ∼ N (0, σ 2 n ). For notational simplicity the functions in the above model are a priori zero mean and the measurement errors are independent Gaussian, but the results of this paper can be easily generalized to arbitrary mean functions and dependent Gaussian errors. The direct solution to the GP regression problem (34) gives the predictions . The conditional mean and variance can be computed in closed-form as (Rasmussen and Williams, 2006) where y is a vector of the n observations. In order to avoid the n × n matrix inversion in (35), we use the approximation scheme presented in the previous section and project the process to a truncated set of m basis functions of the Laplacian as given in Equation (20) such that where f j ∼ N (0, S( λ j )). We can then form an approximate eigendecomposition of the matrix K ≈ ΦΛΦ T , where Λ is a diagonal matrix of the leading m approximate eigenvalues such that Λ jj = S( λ j ), j = 1, 2, . . . , m. Here S(·) is the spectral density of the Gaussian process and λ j the jth eigenvalue of the Laplace operator. The corresponding eigenvectors in the decomposition are given by the eigenvectors φ j (x) of the Laplacian such that Φ ij = φ j (x i ).
Using the matrix inversion lemma we rewrite (35) as follows: where φ * is an m-dimensional vector with the jth entry being φ j (x * ). Thus, when the size of the training set is higher than the number of required basis functions n > m, the use of this approximation is advantageous.

Learning the Hyperparameters
A common way to learn the hyperparameters θ of the covariance function (suppressed earlier in the notation for brevity) and the noise variance σ 2 n is by maximizing the marginal likelihood function (Rasmussen and Williams, 2006;Quiñonero-Candela and Rasmussen, 2005b). Let Q = K + σ 2 n I for the full model, then the negative log marginal likelihood and its derivatives are and they can be combined with a conjugate gradient optimizer. The problem in this case is the inversion of Q, which is an n × n matrix. And thus each step of running the optimizer is O(n 3 ). For our approximation scheme, letQ = ΦΛΦ T + σ 2 n I. Now replacing Q withQ in the above expressions gives us the following: where for the terms involving log |Q|: and for the terms involvingQ −1 : where Z = σ 2 n Λ −1 + Φ T Φ. For efficient implementation, matrix-to-matrix multiplications can be avoided in many cases, and the inversion of Z can be carried out through Cholesky factorization for numerical stability. This factorization (LL T = Z) can also be used for the term log |Z| = 2 j log L jj , and Tr Z −1 Λ −1 = j 1/(Z jj Λ jj ) can be evaluated by element-wise multiplication.
Once the marginal likelihood and its derivatives are available, it is also possible to use other methods for parameter inference such as Markov chain Monte Carlo methods (Liu, 2001;Brooks et al., 2011) including Hamiltonian Monte Carlo (HMC, Duane et al., 1987;Neal, 2011) as well as numerous others.

Discussion on the Computational Complexity
As can be noted from Equation (20), the basis functions in the reduced-rank approximation do not depend on the hyperparameters of the covariance function. Thus it is enough to calculate the product Φ T Φ only once, which means that the method has a overall asymptotic computational complexity of O(nm 2 ). After this initial cost, evaluating the marginal likelihood and the marginal likelihood gradient is an O(m 3 ) operation-which in practice comes from the Cholesky factorization of Z on each step.
If the number of observations n is so large that storing the n×m matrix Φ is not feasible, the computations of Φ T Φ can be carried out in blocks. Storing the evaluated eigenfunctions in Φ is not necessary, because the φ j (x) are closed-form expressions that can be evaluated when necessary. In practice, it might be preferable to cache the result of Φ T Φ (causing a memory requirement scaling as O(m 2 )), but this is not required.
The computational complexity of conventional sparse GP approximations typically scale as O(nm 2 ) in time for each step of evaluating the marginal likelihood. The scaling in demand of storage is O(nm). This comes from the inevitable cost of re-evaluating all results involving the basis functions on each step and storing the matrices required for doing this. This applies to all the methods that will be discussed in Section 5, with the exception of SSGP, where the storage demand can be relaxed by re-evaluating the basis functions on demand.
We can also consider the rather restricting, but in certain applications often encountered case, where the measurements are constrained to a regular grid. This causes the product of the orthonormal eigenfunction matrices Φ T Φ to be diagonal, avoiding the calculation of the matrix inverse altogether. This relates to the FFT-based methods for GP regression (Paciorek, 2007;Fritz et al., 2009), and the projections to the basis functions can be evaluated by fast Fourier transform in O(n log n) time complexity.

Convergence Analysis
In this section we analyze the convergence of the proposed approximation when the size of the domain Ω and the number of terms in the series grows to infinity. We start by analyzing a univariate problem in the domain Ω = [−L, L] and with Dirichlet boundary conditions and then generalize the result to d-dimensional cubes Ω = [−L 1 , We also discuss how the analysis could be extended to other types of basis functions.

Univariate Dirichlet Case
In the univariate case, the m-term approximation has the form where the eigenfunctions and eigenvalues are: The true covariance function k(x, x ) is assumed to be stationary and have a spectral density which is uniformly bounded S(ω) < ∞, has at least two bounded derivatives |S (ω)| < ∞, |S (ω)| < ∞, and has a bounded integral over the real axis We also assume that our training and test sets are constrained in the area [− L, L], where L < L, and thus we are only interested in the case x, x ∈ [− L, L]. For the purposes of analysis we also assume that L is bounded below by a constant.
The univariate convergence result can be summarized as the following theorem which is proved in Appendix A.1.
which in turn implies that uniformly Remark 4.2. Note that we cannot simply exchange the order of the limits in the above theorem. However, the theorem does ensure the convergence of the approximation in the joint limit m, L → ∞ provided that we add terms to the series fast enough such that m/L → ∞. That is, in this limit, the approximation k m (x, x ) converges uniformly to k(x, x ).
As such, the results above only ensure the convergence of the prior covariance functions. However, it turns out that this also ensures the convergence of the posterior as is summarized in the following corollary.
Corollary 4.3. Because the Gaussian process regression equations only involve pointwise evaluations of the kernels, it also follows that the posterior mean and covariance functions converge uniformly to the exact solutions in the limit m, L → ∞.

Multivariate Cartesian Dirichlet Case
In order to generalize the results from the previous section, we turn our attention to a d-dimensional inputs space with rectangular domain Ω = [−L 1 , Dirichlet boundary conditions. In this case we consider a truncated m =m d term approximation of the form with the eigenfunctions and eigenvalues The true covariance function k(x, x ) is assumed to be stationary and have a spectral density S(ω) which is two times differentiable and the derivatives are assumed to be bounded. We also assume that the single-variable integrals are finite ∞ −∞ S(ω) dω k < ∞, which in this case is equivalent to requiring that the integral over the whole space is finite R d S(ω) dω < ∞. Furthermore, we assume that the training and test sets are contained in the d-dimensional cube [− L, L] d and that L k s are bounded from below.
The following result for this d-dimensional case is proved in Appendix A.2.
Theorem 4.4. There exists a constant C d such that where L = min k L k , which in turn implies that uniformly Remark 4.5. Analogously as in the one-dimensional case we cannot simply exchange the order of the limits above. Furthermore, we need to add terms fast enough so thatm/L k → ∞ when m, L 1 , . . . , L d → ∞.
Corollary 4.6. As in the one-dimensional case, the uniform convergence of the prior covariance function also implies uniform convergence of the posterior mean and covariance in the limit m, L 1 , . . . , L d → ∞.

Other Domains
It would also be possible carry out similar convergence analysis, for example, in a spherical domain. In that case the technical details become slightly more complicated, because instead sinusoidals we will have Bessel functions and the eigenvalues no longer form a uniform grid. This means that instead of Riemann integrals we need to consider weighted integrals where the distribution of the zeros of Bessel functions is explicitly accounted for. It might also be possible to use some more general theoretical results from mathematical analysis to obtain the convergence results. However, due to these technical challenges more general convergence proof will be developed elsewhere.
There is also a similar technical challenge in the analysis when the basis functions are formed by assuming an input density (see Section 2.4) instead of a bounded domain. Because explicit expressions for eigenfunctions and eigenvalues cannot be obtained in general, the elementary proof methods which we used here cannot be applied. Therefore the convergence analysis of this case is also left as a topic for future research.

Relationship to Other Methods
In this section we compare our method to existing sparse GP methods from a theoretical point of view. We consider two different classes of approaches: a class of inducing input methods based on the Nyström approximation (following the interpretation of Quiñonero-Candela and Rasmussen, 2005a), and direct spectral approximations.

Methods from the Nyström Family
A crude but rather effective scheme for approximating the eigendecomposition of the Gram matrix is the Nyström method (see, e.g., Baker, 1977, for the integral approximation scheme). This method is based on choosing a set of m inducing inputs x u and scaling the corresponding eigendecomposition of their corresponding covariance matrix K u,u to match that of the actual covariance. The Nyström approximations to the jth eigenvalue and eigenfunction are where λ u,j and φ u,j correspond to the jth eigenvalue and eigenvector of K u,u . This scheme was originally introduced to the GP context by Williams and Seeger (2001). They presented a sparse scheme, where the resulting approximate prior covariance over the latent variables is K f,u K −1 u,u K u,f , which can be derived directly from Equations (58) and (59). As discussed by Quiñonero-Candela and Rasmussen (2005a), the Nyström method by Williams and Seeger (2001) does not correspond to a well-formed probabilistic model. However, several methods modifying the inducing point approach are widely used. The Subset of Regressors (SOR, Smola and Bartlett, 2001) method uses the Nyström approximation scheme for approximating the whole covariance function, whereas the sparse Nyström method (Williams and Seeger, 2001) only replaces the training data covariance matrix. The SOR method is in this sense a complete Nyström approximation to the full GP problem. A method in-between is the Deterministic Training Conditional (DTC, Csató and Opper, 2002;Seeger et al., 2003) method which retains the true covariance for the training data, but uses the approximate cross-covariances between training and test data. For DTC, tampering with the covariance matrix causes the result not to actually be a Gaussian process. The Variational Approximation (VAR, Titsias, 2009) method modifies the DTC method by an additional trace term in the likelihood that comes from the variational bound. The Fully Independent (Training) Conditional (FIC, Quiñonero-Candela and Rasmussen, 2005a) method (originally introduced as Sparse Pseudo-Input GP by Snelson and Ghahramani, 2006) is also based on the Nyström approximation but contains an additional diagonal term replacing the diagonal of the approximate covariance matrix with the values from the true covariance. The corresponding prior covariance function for FIC, is thus where δ i,j is the Kronecker delta. Figure 3 illustrates the effect of the approximations compared to the exact correlation structure in the GP. The dashed contours show the exact correlation contours computed for three locations with the squared exponential covariance function. Figure 3a shows the results for the FIC approximation with 16 inducing points (locations shown in the figure). It is clear that the number of inducing points or their locations are not sufficient to capture the correlation structure. For similar figures and discussion on the effects of the inducing points, see Vanhatalo et al. (2010). This behavior is not unique to SOR or FIC, but applies to all the methods from the Nyström family.

Direct Spectral Methods
The sparse spectrum GP (SSGP) method introduced by Lázaro-Gredilla et al. (2010) uses the spectral representation of the covariance function for drawing random samples from the spectrum. These samples are used for representing the GP on a trigonometric basis where the spectral points s r , r = 1, 2, . . . , h (2h = m), are sampled from the spectral density of the original stationary covariance function (following the normalization convention used in the original paper). The covariance function corresponding to the SSGP scheme is now of the form where σ 2 is the magnitude scale hyperparameter. This representation of the sparse spectrum method converges to the full GP in the limit of the number of spectral points going to infinity, and it is the preferred formulation of the method in one or two dimensions (see Lázaro-Gredilla, 2010, for discussion). We can interpret the SSGP method in (63) as a Monte Carlo approximation of the Wiener-Khintchin integral. In order to have a representative sample of the spectrum, the method typically requires the number of spectral points to be large. For high-dimensional inputs the number of required spectral points becomes overwhelming, and optimizing the spectral locations along with the hyperparameters attractive. However, as argued by , this option does not converge to the full GP and suffers from overfitting to the training data. Contours for the sparse spectrum SSGP method are visualized in Figure 3c. Here the spectral points were chosen at random following Lázaro-Gredilla (2010). Because the basis functions are spanned using both sines and cosines, the number of spectral points was h = 8 in order to match the rank m = 16. These results agree well with those presented in the Lázaro-Gredilla et al. (2010) for a one-dimensional example. For this particular set of spectral points some directions of the contours happen to match the true values very well, while other directions are completely off. Increasing the rank from 16 to 100 would give comparable results to the other methods.
While SSGP is based on a sparse spectrum, the reduced-rank method proposed in this paper aims to make the spectrum as 'full' as possible at a given rank. While SSGP can be interpreted as a Monte Carlo integral approximation, the corresponding interpretation to the proposed method would a numerical quadrature-based integral approximation (cf. the convergence proof in Appendix A.1). Figure 3d shows the same contours obtained by the proposed reduced-rank method. Here the eigendecomposition of the Laplace operator has been obtained for the square Ω = [−L, L] × [−L, L] with Dirichlet boundary conditions. The contours match well with the full solution towards the middle of the domain. The boundary effects drive the process to zero, which is seen as distortion near the edges. Figure 3e shows how extending the boundaries just by 25% and keeping the number of basis functions fixed at 16, gives good results. The last Figure 3f corresponds to using a disk shaped domain instead of the rectangular. The eigendecomposition of the Laplace operator is done in polar coordinates, and the Dirichlet boundary is visualized by a circle in the figure.

Experiments
In this section we aim to provide examples of the practical use of the proposed method, and to compare it against other methods that are typically used in a similar setting. We start with a small simulated one-dimensional dataset, and then provide more extensive comparisons by using real-world data. We also consider an example of data, where the input domain is the surface of a sphere, and conclude our comparison by using a very large dataset to demonstrate what possibilities the computational benefits open.

Experimental Setup
For assessing the performance of different methods we use 10-fold cross-validation and evaluate the following measures based on the validation set: the standardized mean squared error (SMSE) and the mean standardized log loss (MSLL), respectively defined as: n are the predictive mean and variance for test sample i = 1, 2, . . . , n * , and y * i is the actual test value. The training data variance is denoted by Var[y]. For all experiments, the values reported are averages over ten repetitions.
We compare our solution to SOR, DTC, VAR, and FIC using the implementations provided in the GPstuff software package (version 4.3.1, see Vanhatalo et al., 2013) for Mathworks Matlab. The sparse spectrum SSGP method by  was implemented into the GPstuff toolbox for the comparisons. 1 The reference implementation was modified such that also non-ARD covariances could be accounted for.
The m inducing inputs for SOR, DTC, VAR, and FIC were chosen at random as a subset from the training data and kept fixed between the methods. For low-dimensional inputs, this tends to lead to good results and avoid over-fitting to the training data, while optimizing the input locations alongside hyperparameters becomes the preferred approach in high input dimensions (Quiñonero-Candela and Rasmussen, 2005a). The results are averaged over ten repetitions in order to present the average performance of the methods. In Sections 6.2, 6.3, and 6.5, we used a Cartesian domain with Dirichlet boundary conditions for the new reduced-rank method. To avoid boundary effects, the domain was extended by 10% outside the inputs in each direction.
In the comparisons we followed the guidelines given by Chalupka et al. (2013) for making comparisons between the actual performance of different methods. For hyperparameter optimization we used the fminunc routine in Matlab with a Quasi-Newton optimizer. We also tested several other algorithms, but the results were not sensitive to the choice of optimizer. The optimizer was run with a termination tolerance of 10 −5 on the target function value and on the optimizer inputs. The number of required target function evaluations stayed fairly constant for all the comparisons, making the comparisons for the hyperparameter learning bespoke. Figure 4 shows a simulated example, where 256 data points are drawn from a Gaussian process prior with a squared exponential covariance function. We use the same parametrization as Rasmussen and Williams (2006) and denote the signal variance σ 2 , length-scale , and noise variance σ 2 n . Figure 4b shows the negative marginal log likelihood curves both for the full GP and the approximation with m = 32 basis functions. The likelihood curve approximations are almost exact and only differs from the full GP likelihood for small length-scales (roughly for values smaller than 2L/m). Figure 4a shows the approximate GP solution. The mean estimate follows the exact GP mean, and the shaded region showing the 95% confidence area differs from the exact solution (dashed) only near the boundaries.   enough (m = 20) to account for the short length-scales, the approximation converges to the exact full GP solution (shown by the dashed line).

Toy Example
In this case the SOR method that uses the Nyström approximation to directly approximate the spectrum of the full GP (see Section 5) seems to give good results. However, as the resulting approximation in SOR corresponds to a singular Gaussian distribution, the predictive variance is underestimated. This can be seen in Figure 5b, where SOR seems to give better results than the full GP. These results are however due to the smaller predictive variance on the test set. DTC tries to fix this shortcoming of SOR-they are identical in other respects except predictive variance evaluation-and while SOR and DTC give identical results in terms of SMSE, they differ in MSLL. We also note that additional trace term in the marginal likelihood in VAR makes the likelihood surface flat, which explains the differences in the results in comparison to DTC.
The sparse spectrum SSGP method did not perform well on average. Still, it can be seen that it converges towards the performance of the full GP. The dependence on the number of spectral points differs from the rest of the methods, and a rank of m = 32 is not enough to meet the other methods. However, in terms of best case performance over the ten repetitions with different inducing inputs and spectral points, both FIC and SSGP outperformed SOR, DTC, and VAR. Because of its 'dense spectrum' approach, the proposed reduced-rank method is not sensitive to the choice of spectral points, and thus the performance remained the same between repetitions. In terms of variance over the 10-fold cross-validation folds, the methods in order of growing variance in the figure legend (the variance approximately doubling between FULL and SSGP).

Precipitation Data
As a real-data example, we consider a precipitation data set that contain US annual precipitation summaries for year 1995 (d = 2 and n = 5776, available online, see Vanhatalo et al., 2013). The observation locations are shown on a map in Figure 6a. We limit the number of inducing inputs and spectral points to m = 128, 192, . . . , 512. For the new method we additionally consider ranks m = 1024, 1536, . . . , 4096, and show that this causes a computational burden of the same order as the conventional sparse GP methods with smaller ms.
In order to demonstrate the computational benefits of the proposed model, we also present the running time of the GP inference (including hyperparameter optimization). All methods were implemented under a similar framework in the GPstuff package, and they all employ similar reformulations for numerical stability. The key difference in the evaluation times comes from hyperparameter optimization, where SOR, DTC, VAR, FIC, and SSGP scale as O(nm 2 ) for each evaluation of the marginal likelihood. The proposed reduced-rank method scales as O(m 3 ) for each evaluation (after an initial cost of O(nm 2 )). Figures 5c and 5d show the SMSE and MSLL results for this data against evaluation time. On this scale we note that the evaluation time and accuracy, both in terms of SMSE and MSLL, are alike for SOR, DTC, VAR, and FIC. SSGP is faster to evaluate in comparison with the Nyström family of methods, which comes from the simpler structure of the approximation. Still, the number of required spectral points to meet a certain average error level is larger for SSGP.
The results for the proposed reduced-rank method (NEW) show that with two input dimensions, the required number of basis functions is larger. For the first seven points, Figure 7: Modeling of the yearly mean temperature on the spherical surface of the Earth (n = 11 028). Figure 7b shows the standard deviation contours which match well with the continents.
we notice that even though the evaluation is two orders of magnitude faster, the method performs only slightly worse in comparison to conventional sparse methods. By considering higher ranks (the next seven points), the new method converges to the performance of the full GP (both in SMSE and MSLL), while retaining a computational time comparable to the conventional methods. This type of spatial medium-size GP regression problems can thus be solved in seconds. Figures 6b and 6c show interpolation of the precipitation levels using a full GP model and the reduced-rank method (m = 1728), respectively. The results are practically identical, as is easy to confirm from the color surfaces. Obtaining the reduced-rank result (including initialization and hyperparameter learning) took slightly less than 30 seconds on a laptop computer (MacBook Air, Late 2010 model, 2.13 GHz, 4 GB RAM), while the full GP inference took approximately 18 minutes.

Temperature Data on the Surface of the Globe
We also demonstrate the use of the method in non-Cartesian coordinates. We consider modeling of the spatial mean temperature over a number of n = 11 028 locations around the globe. 2 As earlier demonstrated in Figure 2, we use the Laplace operator in spherical coordinates as defined in (31). The eigenfunctions for the angular part are the Laplace's spherical harmonics. The evaluation of the approximation does not depend on the coordinate system, and thus all the equations presented in the earlier sections remain valid. We use the squared exponential covariance function and m = 1 089 basis functions. Figure 7 visualizes the modeling outcome. The results are visualized using an interrupted projection (an adaption of the Goode homolosine projection) in order to preserve the length-scale structure across the map. The uncertainty is visualized in Figure 7b, which corresponds to the n = 11 028 observation locations that are mostly spread over the continents and western countries (the white areas in Figure 7b contain no observations).
Obtaining the reduced-rank result (including initialization and hyperparameter learning) took approximately 50 seconds on a laptop computer (MacBook Air, Late 2010 model, 2.13 GHz, 4 GB RAM), which scales with n in comparison to the evaluation time in the previous section.

Apartment Price Data
In order to fully use the computational benefits of the method, we consider a large dataset.
We use records of sold apartments 3 in the UK for the period of February to October 2013. The data consist of n = 102 890 records for apartments, which were cross-referenced against a postcode database to get the geographical coordinates on which the normalized logarithmic prices were regressed. The dataset is similar to that used in Hensman et al. (2013), where the records were for year 2012.
To account for both the national and regional variations in apartment prices, we used two squared exponential covariance functions with different length-scales and magnitudes. Additionally, a Gaussian noise term captures the variation that is not related to location alone. Applying the reduced-rank methodology to a sum of covariances is straight-forward, as the the kernel approximations share basis functions and only the spectra have to be summed.
To validate the results, because the full GP solution is infeasible, we used the subset of data approach as was done in Hensman et al. (2013). We solved the full GP problem by considering subsets of n = 500 and n = 1000 data points randomly chosen from the training set. For each fold in the cross-validation the results were averaged over ten choices of subsets. The rank of the reduced-rank approximation was fixed at m = 1000 in order to match with the larger of the two subsets. Table 1 shows the SMSE and MSLL values for the apartment data. The results show that the reduced rank method. The results show that the proposed method gives good results in terms of both SMSE and MSLL, and the standard deviation between the folds is also small. In this case the reduced-rank result (including initialization and hyperparameter learning) took approximately 130 seconds on a laptop computer (MacBook Air, Late 2010 model, 2.13 GHz, 4 GB RAM).

Conclusion and Discussion
In this paper we have proposed a novel approximation scheme for forming approximate eigendecompositions of covariance functions in terms of the Laplace operator eigenbasis and the spectral density of the covariance function. The eigenfunction decomposition of the Laplacian can easily be formed in various domains, and the eigenfunctions are independent of the choice of hyperparameters of the covariance.
An advantage of the method is that it has the ability to approximate the eigendecomposition using only the eigendecomposition of the Laplacian and the spectral density of the covariance function, both of which are closed-from expressions. This together with having the eigenvectors in Φ mutually orthogonal and independent of the hyperparameters, is the key to efficiency. This allows an implementation with a computational cost of O(nm 2 ) (initial) and O(m 3 ) (marginal likelihood evaluation), with negligible memory requirements.
Of the infinite number of possible basis functions only an extremely small subset are of any relevance to the GP being approximated. In GP regression the model functions are conditioned on a covariance function (kernel), which imposes desired properties on the solutions. We choose the basis functions such that they are as close as possible (w.r.t. the Frobenius norm) to those of the particular covariance function. Our method gives the exact eigendecomposition of a GP that has been constrained to be zero at the boundary of the domain.
The method allows for theoretical analysis of the error induced by the truncation of the series and the boundary effects. This is something new in this context and extremely important, for example, in medical imaging applications. The approximative eigendecomposition also opens a range of interesting possibilities for further analysis. In learning curve estimation, the eigenvalues of the Gaussian process can now be directly approximated. For example, we can approximate the Opper-Vivarelli bound (Opper and Vivarelli, 1999) as Sollich's eigenvalue based bounds (Sollich and Halees, 2002) can be approximated and analyzed in an analogous way. However, some of these abilities come with a cost. As demonstrated throughout the paper, restraining the domain to boundary conditions introduces edge effects. These are, however, known and can be accounted for. Extrapolating with a stationary covariance function outside the training inputs only causes the predictions to revert to the prior mean and variance. Therefore we consider the boundary effects a minor problem for practical use.
A more severe limitation for applicability is the 'full' nature of the spectrum. For high-dimensional inputs the required number of basis functions grows large. There is, however, a substantial call for GPs in low-dimensional problems, especially in medical imaging applications (typical number of training data points in millions) and spatial problems. Furthermore, the mathematical formulation of the method provides a foundation for future sparse methods to build upon. A step in this direction has been taken by , which has shown good results in high-dimensional input spaces.

Appendix A. Proofs of Convergence Theorems
A.1 Proof of Theorem 4.1 The Wiener-Khinchin identity and the symmetry of the spectral density allows us to write In a one-dimensional domain Ω = [−L, L] with Dirichet boundary conditions we have an m-term approximation of the form We start by showing the convergence by growing the domain and therefore first consider an approximation with an infinite number of terms m = ∞: Lemma A.1. There exists a constant D 1 such that for all That is, Proof. We can rewrite the summation in (68) as First consider the first term above in Equation (71). Let ∆ = π 2L , and thus it can be seen to have the form which can be recognized as a Riemannian sum approximation to the integral 1 π ∞ 0 S(ω) cos(ω (x− x )) dω. Because we assume that x, x ∈ [− L, L], the integrand and its derivatives are bounded and because the integral ∞ −∞ S(ω) dω < ∞, the Riemannian integral converges, and hence we conclude that for some constant D 2 . The second summation term in Equation (71) can also be interpreted as a Riemann sum if we set ∆ = π L : Because we assumed that also the second derivative of S(·) is bounded, the derivative and the Riemann sum converge (alternatively, we could analyze the sums as a Stieltjes integral with respect to a differentiable function), and hence the exists a constant D 3 such that But now because ∞ 0 2 S (ω) cos(ω (x + x )) dω < ∞, this actually implies that for some constant D 3 . For the last summation term in Equation (71) we get the interpretation which by boundedness of x and x implies for some constant D 4 . The result now follows by combining (73), (76), and (78) via the triangle inequality.
Let us now return to the original question, and consider what happens when we replace the infinite sum approximation with a finite m number of terms. We are now interested in Lemma A.2. There exists a constant D 5 such that for all x, x ∈ [− L, L] we have Proof. Because the sinusoidals are bounded by unity, we get The right-hand term can now be seen as Riemann sum approximation to the integral Our assumptions ensure that this integral converges and hence there exists a constant D 5 such that Hence by the triangle inequality we get and thus the result follows.
The above result can now easily be combined to a proof of the one-dimensional convergence theorem as follows: Proof of Theorem 4.1. The first result follows by combining Lemmas A.1 and A.2 via the triangle inequality. Because our assumptions imply that for any fixed L we have If we now take the limit L → ∞, the second result in the theorem follows.
A.2 Proof of Theorem 4.4 When x ∈ R d , the Wiener-Khinchin identity and symmetry of the spectral density imply that The m =m d term approximation now has the form k m (x, x ) =m j 1 ,...,j d =1 S π j 1 2L 1 , . . . , π j d 2L d d k=1 1 L k sin π j k (x k + L k ) 2L k sin π j k (x k + L k ) 2L k .
The triangle inequality then gives ∞ j 1 ,...,j d =1 S π j 1 2L 1 , . . . , We can now similarly bound with respect to the summations over j 2 , . . . , j d which leads to a bound of the form D 1,1 L 1 + · · · + D 1,d L d . Taking D 1 = max k D 1,k leads to the desired result. Now we can consider what happens in the finite truncation of the series. That is, we analyze the following residual sum Lemma A.4. The exists a constant D 2 such that for all x, x ∈ [− L, L] d we have where L = min k L k .
Proof. We can write the following bound ∞ j 1 ,...,j d =m+1 The summation over the index j 1 can now be interpreted as a Riemann integral approximation with ∆ = π 2L 1 giving ∞ j 1 ,...,j d =m+1 S ω 1 , π j 2 2L 2 , . . . , After repeating this for all the indexes, by forming a telescoping sum of the terms and applying the triangle inequality then gives