Abstract
Gaussian processes (GPs) are popular models for functions, time series, and spatial fields, but direct application of GPs is computationally infeasible for large datasets. We propose a multi-scale Vecchia (MSV) approximation of GPs for modeling and analysis of multi-scale phenomena, which are ubiquitous in geophysical and other applications. In the MSV approach, increasingly large sets of variables capture increasingly small scales of spatial variation, to obtain an accurate approximation of the spatial dependence from very large to very fine scales. For a given set of observations, the MSV approach decomposes the data into different scales, which can be visualized to obtain insights into the underlying processes. We explore properties of the MSV approximation and propose an algorithm for automatic choice of the tuning parameters. We provide comparisons to existing approaches based on simulated data and using satellite measurements of land-surface temperature.
Similar content being viewed by others
References
Ba S, Joseph VR (2012) Composite Gaussian process models for emulating expensive functions. Ann Appl Stat 6(4):1838–1860
Banerjee S, Carlin BP, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data. Chapman & Hall, Cambridge
Banerjee S, Gelfand AE, Finley AO, Sang H (2008) Gaussian predictive process models for large spatial data sets. J Roy Stat Soc B 70(4):825–848
Comer ML, Delp EJ (1999) Segmentation of textured images using a multiresolution gaussian autoregressive model. IEEE Trans Image Process 8(3):408–420
Cotton WR, Bryan G, Van den Heever SC (2010) Storm and cloud dynamics, volume 99. Academic Press
Cressie N, Johannesson G (2008) Fixed rank kriging for very large spatial data sets. J Roy Stat Soc B 70(1):209–226
Cressie N, Wikle CK (2011) Statistics for spatio-temporal data. Wiley, Hoboken, NJ
Datta A, Banerjee S, Finley AO, Gelfand AE (2016) Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J Am Stat Assoc 111(514):800–812
Du J, Zhang H, Mandrekar VS (2009) Fixed-domain asymptotic properties of tapered maximum likelihood estimators. Ann Stat 37:3330–3361
Duvenaud D, Lloyd JR, Grosse R, Tenenbaum JB, Ghahramani Z (2013) Structure discovery in nonparametric regression through compositional kernel search. In: Proceedings of the 30th international conference on machine learning, vol 28, pp 1166–1174
Ferreira MA, Lee HK (2007) Multiscale modeling: a Bayesian perspective. Springer Science & Business Media, Berlin
Ferreira MA, West M, Lee HK, Higdon DM et al (2006) Multi-scale and hidden resolution time series models. Bayesian Anal 1(4):947–967
Furrer R, Genton MG, Nychka D (2006) Covariance tapering for interpolation of large spatial datasets. J Comput Graph Stat 15(3):502–523
Gneiting T, Katzfuss M (2014) Probabilistic forecasting. Ann Rev Stat Appl 1(1):125–151
Gotway CA, Young LJ (2002) Combining incompatible spatial data. J Am Stat Assoc 97(458):632–648
Guinness J (2018) Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics 60(4):415–429
Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, Zammit-Mangion A (2019) A case study competition among methods for analyzing large spatial data. J Agric Biol Environ Stat 24(3):398–425
Higdon D (1998) A process-convolution approach to modelling temperatures in the North Atlantic Ocean. Environ Ecol Stat 5(2):173–190
Huang H-C, Cressie N, Gabrosek J (2002) Fast, resolution-consistent spatial prediction of global processes from satellite data. J Comput Graph Stat 11(1):63–88
Katzfuss M (2017) A multi-resolution approximation for massive spatial datasets. J Am Stat Assoc 112(517):201–214
Katzfuss M, Cressie N (2011) Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets. J Time Ser Anal 32(4):430–446
Katzfuss M, Gong W (2020) A class of multi-resolution approximations for large spatial datasets. Stat Sin 30(4):2203–2226
Katzfuss M, Guinness J (2021) A general framework for Vecchia approximations of Gaussian processes. Stat Sci 36(1):124–141
Katzfuss M, Guinness J, Gong W, Zilber D (2020) Vecchia approximations of Gaussian-process predictions. J Agric Biol Environ Stat 25(3):383–414
Katzfuss M, Jurek M, Zilber D, Gong W, Guinness J, Zhang J, Schäfer F (2020b) GPvecchia: fast Gaussian-process inference using Vecchia approximations. R package version 0.1.3
Kaufman CG, Schervish MJ, Nychka DW (2008) Covariance tapering for likelihood-based estimation in large spatial data sets. J Am Stat Assoc 103(484):1545–1555
Kim S-W, Yoon S-C, Kim J, Kim S-Y (2007) Seasonal and monthly variations of columnar aerosol optical properties over east asia determined from multi-year modis, lidar, and aeronet sun/sky radiometer measurements. Atmos Environ 41(8):1634–1651
Lindgren F, Rue H, Lindström J (2011) An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J Roy Stat Soc B 73(4):423–498
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge
Sang H, Jun M, Huang JZ (2011) Covariance approximation for large multivariate spatial datasets with an application to multiple climate model errors. Ann Appl Stat 5(4):2519–2548
Saquib SS, Bouman CA, Sauer K (1996) A non-homogeneous mrf model for multiresolution bayesian estimation. In: Proceedings of 3rd IEEE international conference on image processing, vol 2, pp 445–448. IEEE
Schäfer F, Katzfuss M, Owhadi H (2020) Sparse Cholesky factorization by Kullback-Leibler minimization. arXiv:2004.14455
Schäfer F, Sullivan TJ, Owhadi H (2017) Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. arXiv:1706.02205
Skøien JO, Blöschl G, Western A (2003) Characteristic space scales and timescales in hydrology. Water Resour Res, 39(10)
Snelson E, Ghahramani Z (2007) Local and global sparse Gaussian process approximations. In: Artif Intell Stati 11 (AISTATS)
Sobolewska MA, Siemiginowska A, Kelly BC, Nalewajko K (2014) Stochastic modeling of the Fermi/LAT \(\gamma \)-ray blazar variability. Astrophys J, 786(143)
Tzeng S, Huang H-C, Cressie N (2005) A fast, optimal spatial-prediction method for massive datasets. J Am Stat Assoc 100(472):1343–1357
Vecchia A (1988) Estimation and model identification for continuous spatial processes. J Roy Stat Soc B 50(2):297–312
Wikle CK, Cressie N (1999) A dimension-reduced approach to space-time Kalman filtering. Biometrika 86(4):815–829
Wilson AG, Adams RP (2013) Gaussian process kernels for pattern discovery and extrapolation. In: Proceedings of the 30th international conference on machine learning
Wilson AG, Gilboa E, Nehorai A, Cunningham JP (2014) Fast kernel learning for multidimensional pattern extrapolation. In: Advances in neural information processing systems, pp 3626–3634
Zhu J, Morgan CL, Norman JM, Yue W, Lowery B (2004) Combined mapping of soil properties using a multi-scale tree-structured spatial model. Geoderma 118(3–4):321–334
Acknowledgements
Katzfuss’ research was partially supported by National Science Foundation (NSF) Grants DMS–1521676, DMS–1654083, and DMS–1953005. We would like to thank David Jones for helpful comments. Part of our MSV implementation was inspired by the R package GPvecchia (Katzfuss et al. 2020b), and it relies on a fast maximum–minimum distance ordering algorithm by Florian Schaefer.
Author information
Authors and Affiliations
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Computing \(\mathbf {U}\)
Extending the derivations in Section 2.4.1, the sparse upper triangular matrix \(\mathbf {U}\) can be specified by the following rules:
(1) For each \(\ell = 1,2,..., L-1\), denote \(\mathbf {U}^{(\ell )}\) as the block of \(\mathbf {U}\) corresponding to level \(\ell \) with size \(n_\ell \times n_\ell \). For each \(i=1,2,...,n_\ell \),
For the conditioning set of \(y^{(\ell )}_i\), suppose the s-th element in its conditioning set is \(y^{(\ell )}_{i'}\), then
where \(\{B_i^{(\ell )}\}^s\) is the s-th element of \(B_i^{(\ell )}\).
(2) For the data level L, first denote an \(n \times n\) diagonal matrix by
Next, for \(\ell =1,2,..., L-1\), denote an \(n_\ell \times n\) matrix \(\mathbf {U}^{(L)(\ell )}\) as the block of \(\mathbf {U}\) corresponding to \(\{ N_{z_i}^{(\ell )}, i=1,2,...,n\}\). Then, for each i, suppose the s-th element in \(N_{z_i}^{(\ell )}\) is \(y_{i'}^{(\ell )}\), then
(3) Finally, the matrix \(\mathbf {U}\) is
All unmentioned entries in \(\mathbf {U}\) are 0.
Prediction at unobserved locations
For simplicity in the proof, denote \(\mathcal {S}\) as the locations corresponding to the knot set at level \(\ell \). First, we show \(\left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} =\mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top }\). When \(\ell =1\), denote \(\mathbf {U}= \left( \begin{array}{cc} \mathbf {U}_1 &{} \mathbf {U}_2 \\ \mathbf{0}&{} \mathbf {U}_3 \end{array} \right) \), where \(\mathbf {U}_1= \mathbf {U}^{(1)}\) is the block of \(\mathbf {U}\) corresponding to knot variables at level 1. Then,
Since we can also write \(\hat{\mathbf {C}}^{-1}\) as \(\hat{\mathbf {C}}^{-1}=\left( \begin{array}{cc} C^{(1)}(\mathcal {S}, \mathcal {S}) &{} A \\ A^\top &{} B \end{array} \right) ^{-1}=\left( \begin{array}{cc} E &{} F \\ F^\top &{} G \end{array} \right) \), by the property of matrix inverse in block form, \(C^{(1)}(\mathcal {S}, \mathcal {S})^{-1}=E-FG^{-1}F^\top \). Thus we have
For any \(\ell >1\), similar results hold: \( C^{(\ell )}(\mathcal {S}, \mathcal {S})^{-1}= \mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top }. \)
Using Algorithm 2 and achieving a KL divergence of (almost) zero, the MSV approximation based on the knot set \(\mathbf {y}_\ell \) at level \(\ell \) is (almost) exact, and so we assume that all information about the process at level \(\ell \) is captured by knot set \(\mathbf {y}_\ell \). Then, the posterior mean in (5) at an unobserved location \(\mathbf {s}_0\) can be computed as
The posterior variance in (6) can be computed as
The posterior predictive variance for the entire latent process is
The first two terms can be computed similar to (7). The last term is a linear combination of \(\mathbf {y}_{1:L-1}\), and so it can be calculated by \(var (\mathbf {H}\mathbf {y}_{1:L-1}|z) = (\mathbf {V}^{-1} \mathbf {H}^\top )^\top (\mathbf {V}^{-1} \mathbf {H}^\top )\).
Proofs
Proposition 1
For a polynomial \(y^{(\ell )}(\mathbf {s}) = \mathbf {p}(\mathbf {s})^\top \varvec{\beta }\) as a function of spatial location \(\mathbf {s}\) with p coefficients \(\varvec{\beta } \sim \mathcal {N}_p(\mathbf {0},\varvec{\Sigma }_\beta )\), the corresponding covariance function \(C^{(\ell )}(\mathbf {s}_i, \mathbf {s}_j) = \mathbf {p}(\mathbf {s}_i)^\top \varvec{\Sigma }_\beta \mathbf {p}(\mathbf {s}_j)\) can be captured exactly by setting the knot and conditioning set to be any distinct p locations.
Proof of Proposition 1
Denote any p distinct locations as \(\{\mathbf {s}_1, \mathbf {s}_2,..., \mathbf {s}_p\}\). For polynomial \(y(\mathbf {s}) = \mathbf {p}(\mathbf {s})^\top \varvec{\beta }\) with \(\varvec{\beta } \sim \mathcal {N}_p(\mathbf {0},\varvec{\Sigma }_\beta )\), the system of equations \(\{\mathbf {p}(\mathbf {s}_1)^\top \varvec{\beta } = y(\mathbf {s}_1) , \mathbf {p}(\mathbf {s}_2)^\top \varvec{\beta } = y(\mathbf {s}_2) ,..., \mathbf {p}(\mathbf {s}_p)^\top \varvec{\beta } = y(\mathbf {s}_p) , \mathbf {p}(\mathbf {s})^\top \varvec{\beta } = y(\mathbf {s}) \}\) is equivalent to the system of equations \(\{\mathbf {p}(\mathbf {s}_1)^\top \varvec{\beta } = y(\mathbf {s}_1) , \mathbf {p}(\mathbf {s}_2)^\top \varvec{\beta } = y(\mathbf {s}_2) ,..., \mathbf {p}(\mathbf {s}_p)^\top \varvec{\beta } = y(\mathbf {s}_p) \}\), thus \(P \left( y(\mathbf {s}) | y(\mathbf {s}_1), y(\mathbf {s}_2),...,y(\mathbf {s}_p) \right) = 1\). Then, the exact distribution for \(\mathbf {y}=(y(\mathbf {s}_1), y(\mathbf {s}_2), ..., y(\mathbf {s}_n))\) can be written as
which equals \(\hat{f}(\mathbf {y})\) in Vecchia by setting the knot and conditioning set to be \(\{\mathbf {s}_1, \mathbf {s}_2, ...,\mathbf {s}_p\}\). Thus, the covariance can be captured exactly. \(\square \)
Proof of Theorem 1
The following proof is related to Guinness (2018, Thm. 1). Suppose the true covariance is \( \Sigma _0\) and the approximated covariance is \(\hat{\Sigma }\). At each level \(\ell \), the KL divergence between the two normal distributions can be written as \( KL \left( f(\mathbf {y}^{(\ell )}) || \hat{f}(\mathbf {y}^{(\ell )}) \right) =\frac{1}{2} E \left( -(\mathbf {y}^{(\ell )})^\top \Sigma _0^{-1}\mathbf {y}^{(\ell )} \right) +\frac{1}{2} E \left( (\mathbf {y}^{(\ell )})^\top \hat{\Sigma }^{-1}\mathbf {y}^{(\ell )}\right) +\frac{1}{2} \log \frac{ |\hat{\Sigma } |}{|\Sigma _0 |}\). Since \(\Sigma _0\) is the true covariance, the first term \(E \left( -(\mathbf {y}^{(\ell )})^\top \Sigma _0^{-1}\mathbf {y}^{(\ell )} \right) = -n\). Based on MSV, we have \(\log |\hat{\Sigma } |=\sum _{i=1}^{n} \log D_i^{(\ell )}\). Suppose \(L_0\) is the Cholesky factor of \(\Sigma _0\), then \(E \left( (\mathbf {y}^{(\ell )})^\top \hat{\Sigma }^{-1}\mathbf {y}^{(\ell )}\right) =tr(UU^T \Sigma _0)= \sum _{i,j} (L_0^T U)_{ij}^2 =n\). Thus, the KL divergence can be written as \(KL \left( f(\mathbf {y}^{(\ell )}) || \hat{f}(\mathbf {y}^{(\ell )}) \right) =\frac{1}{2} \left( -n+n+ \sum _{i=1}^{n} \log D_i^{(\ell )}-\log |\Sigma _0 |\right) =\frac{1}{2} \sum _{i=1}^{n} \log D_i^{(\ell )}-constant\). \(\square \)
Rights and permissions
About this article
Cite this article
Zhang, J., Katzfuss, M. Multi-Scale Vecchia Approximations of Gaussian Processes. JABES 27, 440–460 (2022). https://doi.org/10.1007/s13253-022-00488-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13253-022-00488-0