Skip to main content
Log in

Scalable computations for nonstationary Gaussian processes

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript


Nonstationary Gaussian process models can capture complex spatially varying dependence structures in spatial data. However, the large number of observations in modern datasets makes fitting such models computationally intractable with conventional dense linear algebra. In addition, derivative-free or even first-order optimization methods can be slow to converge when estimating many spatially varying parameters. We present here a computational framework that couples an algebraic block-diagonal plus low-rank covariance matrix approximation with stochastic trace estimation to facilitate the efficient use of second-order solvers for maximum likelihood estimation of Gaussian process models with many parameters. We demonstrate the effectiveness of these methods by simultaneously fitting 192 parameters in the popular nonstationary model of Paciorek and Schervish using 107,600 sea surface temperature anomaly measurements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others


  • Ackerman, S., Frey, R.: Modis Atmosphere l2 Cloud Mask Product. NASA MODIS Adaptive Processing system. Goddard Space Flight Center (2015)

    Google Scholar 

  • Ambikasaran, S., O’Neil, M., Singh, K.R.: Fast symmetric factorization of hierarchical matrices with applications (2014). arXiv preprint arXiv:1405.0223

  • Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D.W., O’Neil, M.: Fast direct methods for Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 252–265 (2015)

    Article  Google Scholar 

  • Anderes, E.B., Stein, M.L.: Local likelihood estimation for nonstationary random fields. J. Multivar. Anal. 102(3), 506–520 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Banerjee, S., Gelfand, A.E., Finley, A.O., Sang, H.: Gaussian predictive process models for large spatial data sets. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 70(4), 825–848 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MATH  Google Scholar 

  • Börm, S., Garcke, J.: Approximating Gaussian processes with \({\cal{H}}^2\)-matrices. In: European Conference on Machine Learning, pp. 42–53. Springer (2007)

  • Chen, J., Stein, M.L.: Linear-cost covariance functions for Gaussian random fields. J. Am. Stat. Assoc. 118, 1–18 (2021)

    MathSciNet  Google Scholar 

  • Chen, J., Avron, H., Sindhwani, V.: Hierarchically compositional kernels for scalable nonparametric learning. J. Mach. Learn. Res. 18(1), 2214–2255 (2017)

    MathSciNet  MATH  Google Scholar 

  • Cressie, N., Johannesson, G.: Fixed rank kriging for very large spatial data sets. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 209–226 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Eidsvik, J., Finley, A.O., Banerjee, S., Rue, H.: Approximate Bayesian inference for large spatial datasets using predictive process models. Comput. Stat. Data Anal. 56(6), 1362–1380 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Furrer, R., Genton, M.G., Nychka, D.: Covariance tapering for interpolation of large spatial datasets. J. Comput. Graph. Stat. 15(3), 502–523 (2006)

    Article  MathSciNet  Google Scholar 

  • Geoga, C.J., Marin, O., Schanen, M., Stein, M.L.: Fitting Matérn smoothness parameters using automatic differentiation (2022). arXiv preprint arXiv:2201.00090

  • Geoga, C.J., Anitescu, M., Stein, M.L.: Scalable Gaussian process computations using hierarchical matrices. J. Comput. Graph. Stat. 29, 1–11 (2019)

    MathSciNet  MATH  Google Scholar 

  • Guinness, J.: Gaussian process learning via fisher scoring of Vecchia’s approximation. Stat. Comput. 31(3), 1–8 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Huang, H., Blake, L.R., Katzfuss, M., Hammerling, D.M.: Nonstationary spatial modeling of massive global satellite data (2021). arXiv preprint arXiv:2111.13428

  • Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun. Stat. Simul. Comput. 18(3), 1059–1076 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  • Katzfuss, M.: A multi-resolution approximation for massive spatial datasets. J. Am. Stat. Assoc. 112(517), 201–214 (2017)

    Article  MathSciNet  Google Scholar 

  • Katzfuss, M., Cressie, N.: Bayesian hierarchical spatio-temporal smoothing for very large datasets. Environmetrics 23(1), 94–107 (2012)

    Article  MathSciNet  Google Scholar 

  • Katzfuss, M., Gong, W.: A class of multi-resolution approximations for large spatial datasets. Stat. Sin. 30(4), 2203–2226 (2020)

    MATH  Google Scholar 

  • Katzfuss, M., Guinness, J.: A general framework for Vecchia approximations of gaussian processes. Stat. Sci. 36(1), 124–141 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Khellah, F., Fieguth, P., Murray, M.J., Allen, M.: Statistical processing of large image sequences. IEEE Trans. Image Process. 14(1), 80–93 (2004)

    Article  MathSciNet  Google Scholar 

  • Li, Y., Sun, Y.: Efficient estimation of nonstationary spatial covariance functions with application to high-resolution climate model emulation. Stat. Sin. 29(3), 1209–1231 (2019)

    MathSciNet  MATH  Google Scholar 

  • Lindgren, F., Rue, H., Lindström, J.: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 73(4), 423–498 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Litvinenko, A., Sun, Y., Genton, M.G., Keyes, D.E.: Likelihood approximation with hierarchical matrices for large spatial datasets. Comput. Stat. Data Anal. 137, 115–132 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Maturi, E., Harris, A., Mittaz, J., Sapper, J., Wick, G., Zhu, X., Dash, P., Koner, P.: A new high-resolution sea surface temperature blended analysis. Bull. Am. Meteor. Soc. 98(5), 1015–1026 (2017)

    Article  Google Scholar 

  • Minden, V., Damle, A., Ho, K.L., Ying, L.: Fast spatial Gaussian process maximum likelihood estimation via skeletonization factorizations. Multiscale Model. Simul. 15(4), 1584–1611 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Neal, R.M., et al.: MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, vol. 2(11), p. 2 (2011)

  • Paciorek, C.J., Schervish, M.J.: Spatial modelling using a new class of nonstationary covariance functions. Environ. Off. J. Int. Environ. Soc. 17(5), 483–506 (2006)

    MathSciNet  Google Scholar 

  • Risser, M.D., Calder, C.A.: Local likelihood estimation for covariance functions with spatially-varying parameters: the convospat package for r. arXiv preprint (2015). arXiv:1507.08613

  • Sang, H., Huang, J.Z.: A full scale approximation of covariance functions for large spatial data sets. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 74(1), 111–132 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Sang, H., Jun, M., Huang, J.Z.: Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors. Ann. Appl. Stat. 20, 2519–2548 (2011)

    MathSciNet  MATH  Google Scholar 

  • Snelson, E., Ghahramani, Z.: Local and global sparse Gaussian process approximations. Artif. Intell. Stat. 524–531 (2007)

  • Solin, A., Särkkä, S.: Hilbert space methods for reduced-rank Gaussian process regression. Stat. Comput. 30(2), 419–446 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Stein, M.L., Chen, J., Anitescu, M.: Stochastic approximation of score functions for Gaussian processes. Ann. Appl. Stat. 1162–1191 (2013)

  • Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer (1999)

    Book  MATH  Google Scholar 

  • Stein, M.L.: Limitations on low rank approximations for covariance matrices of spatial data. Spat. Stat. 8, 1–19 (2014)

    Article  MathSciNet  Google Scholar 

  • Stein, M.L., Chi, Z., Welty, L.J.: Approximating likelihoods for large spatial data sets. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 66(2), 275–296 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Vecchia, A.V.: Estimation and model identification for continuous spatial processes. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 50(2), 297–312 (1988)

    MathSciNet  Google Scholar 

  • Wright, S., Nocedal, J.: Numerical optimization. Science 35(67–68), 7 (1999)

    MATH  Google Scholar 

  • Zhang, H.: Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics. J. Am. Stat. Assoc. 99(465), 250–261 (2004)

    Article  MathSciNet  MATH  Google Scholar 

Download references


This material was based upon work supported by the US Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347. We would also like to thank the anonymous referees for many helpful comments which led to significant improvements to the manuscript.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Paul G. Beckman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This material was based upon work supported by the US Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347.



1.1 Computing the Hessian

To employ second-order Newton solvers instead of Fisher scoring to compute the MLE, one must compute the Hessian, whose entries are given by

$$\begin{aligned} \big [-\nabla ^2\ell (\varvec{\theta })\big ]_{jk}&= -{\mathscr {I}}_{jk} + \frac{1}{2}{{\,\textrm{tr}\,}}\Bigg [\tilde{\varvec{\Sigma }}^{-1} \bigg (\frac{\partial ^2\tilde{\varvec{\Sigma }}}{\partial \theta _j\partial \theta _k}\bigg ) \Bigg ]\nonumber \\&\quad +\varvec{y}^\top \tilde{\varvec{\Sigma }}^{-1} \bigg (\frac{\partial \tilde{\varvec{\Sigma }}}{\partial \theta _j}\bigg ) \tilde{\varvec{\Sigma }}^{-1} \bigg (\frac{\partial \tilde{\varvec{\Sigma }}}{\partial \theta _k}\bigg ) \tilde{\varvec{\Sigma }}^{-1} \varvec{y}\nonumber \\&\quad - \frac{1}{2}\varvec{y}^\top \tilde{\varvec{\Sigma }}^{-1} \bigg (\frac{\partial ^2\tilde{\varvec{\Sigma }}}{\partial \theta _j\partial \theta _k}\bigg ) \tilde{\varvec{\Sigma }}^{-1} \varvec{y}. \end{aligned}$$

We can again apply basic matrix differentiation rules to equation (19) to obtain the second derivatives of the Nyström approximation, where the second derivative of the rank-p Nyström approximation

$$\begin{aligned}&\frac{\partial ^2}{\partial \theta _j\partial \theta _k}\Big (\varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \Big )\nonumber \\&\quad = \bigg (\frac{\partial ^2\varvec{\Sigma }_{QP}}{\partial \theta _j\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \nonumber \\&\qquad - \bigg (\frac{\partial \varvec{\Sigma }_{QP}}{\partial \theta _j}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \nonumber \\&\qquad + \bigg (\frac{\partial \varvec{\Sigma }_{QP}}{\partial \theta _j}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{QP}^\top }{\partial \theta _k}\bigg ) \nonumber \\&\qquad - \bigg (\frac{\partial \varvec{\Sigma }_{QP}}{\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _j}\bigg )\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \nonumber \\&\qquad + \varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _j}\bigg )\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \nonumber \\&\qquad - \varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial ^2\varvec{\Sigma }_{PP}}{\partial \theta _j\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \nonumber \\&\qquad + \varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _j}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \nonumber \\&\qquad - \varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _j}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{QP}^\top }{\partial \theta _k}\bigg ) \nonumber \\&\qquad + \bigg (\frac{\partial \varvec{\Sigma }_{QP}}{\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{QP}}{\partial \theta _j}\bigg )^\top \nonumber \\&\qquad - \varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{PP}}{\partial \theta _k}\bigg )\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial \varvec{\Sigma }_{QP}}{\partial \theta _j}\bigg )^\top \nonumber \\&\qquad + \varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\bigg (\frac{\partial ^2\varvec{\Sigma }_{QP}}{\partial \theta _j\partial \theta _k}\bigg )^\top \end{aligned}$$

has rank at most 4p. The second derivative of the approximate covariance matrix \(\tilde{\varvec{\Sigma }}\) is then given by

$$\begin{aligned} \frac{\partial ^2\tilde{\varvec{\Sigma }}}{\partial \theta _j\partial \theta _k} = \varvec{\Pi }^\top \begin{bmatrix} \displaystyle \frac{\partial ^2}{\partial \theta _j\partial \theta _k}\Big (\varvec{\Sigma }_{QP}\varvec{\Sigma }_{PP}^{-1}\varvec{\Sigma }_{QP}^\top \Big ) + \frac{\partial ^2\varvec{D}}{\partial \theta _j\partial \theta _k} &{} \displaystyle \frac{\partial ^2\varvec{\Sigma }_{QP}}{\partial \theta _j\partial \theta _k} \\ \displaystyle \frac{\partial ^2\varvec{\Sigma }_{QP}^\top }{\partial \theta _j\partial \theta _k} &{} \displaystyle \frac{\partial ^2\varvec{\Sigma }_{PP}}{\partial \theta _j\partial \theta _k}\end{bmatrix} \varvec{\Pi }, \nonumber \\ \end{aligned}$$

which follows the same permuted block diagonal plus low-rank structure as \(\tilde{\varvec{\Sigma }}\) and \(\frac{\partial \tilde{\varvec{\Sigma }}}{\partial \theta _j}\), now with rank at most 4p in the low-rank portion of the upper left block. Thus the trace term in each Hessian entry can be computed using equation (21) exactly as was done for the analogous term in the gradient. This results in a linear complexity algorithm for computing entries of the Hessian.

1.2 Nonstationary parameters for simulated data

For our approximation accuracy study in Sect. 4.3, we generate data from the Paciorek-Schervish model (6) using a continuously varying local anisotropy matrix \(\varvec{\Lambda }(\varvec{x})\) given by

$$\begin{aligned} \varvec{\Lambda }(\varvec{x})&= \begin{bmatrix}\cos (\theta \varvec{x})) &{} -\sin (\theta (\varvec{x})) \\ \sin (\theta (\varvec{x})) &{} \cos (\theta (\varvec{x}))\end{bmatrix} \begin{bmatrix}\lambda _1(\varvec{x}) &{} \\ {} &{} \lambda _2(\varvec{x})\end{bmatrix}\nonumber \\&\quad \times \begin{bmatrix}\cos (\theta (\varvec{x})) &{} -\sin (\theta (\varvec{x})) \\ \sin (\theta (\varvec{x})) &{} \cos (\theta (\varvec{x}))\end{bmatrix}^T, \end{aligned}$$
$$\begin{aligned} \theta (\varvec{x})&= \text {acos}\left( \text {min}\left( 1,\frac{\varvec{x}^T \varvec{x}_{\theta }^*}{||\varvec{x}|| ||\varvec{x}_{\theta }^*||}\right) \right) , \end{aligned}$$
$$\begin{aligned} \lambda _1(\varvec{x})&= \exp (2\cos (4 ||\varvec{x} - \varvec{x}_{1}^*||) - 3), \quad \text {and} \end{aligned}$$
$$\begin{aligned} \lambda _2(\varvec{x})&= \exp (2\cos (4 ||\varvec{x} - \varvec{x}_{2}^*||) - 3), \end{aligned}$$

where \(\varvec{x}_{\theta }^* = [1.75, 2.25]\), \(\varvec{x}_{1}^* = [0.75, 0.5]\), and \(\varvec{x}_{2}^* = [0.3, 0.2]\). These functions and values were chosen in an ad hoc way simply to provide sample paths with interesting features that resemble real data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Beckman, P.G., Geoga, C.J., Stein, M.L. et al. Scalable computations for nonstationary Gaussian processes. Stat Comput 33, 84 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: