Skip to main content
Log in

Multi-Scale Vecchia Approximations of Gaussian Processes

  • Published:
Journal of Agricultural, Biological and Environmental Statistics Aims and scope Submit manuscript

Abstract

Gaussian processes (GPs) are popular models for functions, time series, and spatial fields, but direct application of GPs is computationally infeasible for large datasets. We propose a multi-scale Vecchia (MSV) approximation of GPs for modeling and analysis of multi-scale phenomena, which are ubiquitous in geophysical and other applications. In the MSV approach, increasingly large sets of variables capture increasingly small scales of spatial variation, to obtain an accurate approximation of the spatial dependence from very large to very fine scales. For a given set of observations, the MSV approach decomposes the data into different scales, which can be visualized to obtain insights into the underlying processes. We explore properties of the MSV approximation and propose an algorithm for automatic choice of the tuning parameters. We provide comparisons to existing approaches based on simulated data and using satellite measurements of land-surface temperature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Ba S, Joseph VR (2012) Composite Gaussian process models for emulating expensive functions. Ann Appl Stat 6(4):1838–1860

    Article  MathSciNet  Google Scholar 

  • Banerjee S, Carlin BP, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data. Chapman & Hall, Cambridge

    MATH  Google Scholar 

  • Banerjee S, Gelfand AE, Finley AO, Sang H (2008) Gaussian predictive process models for large spatial data sets. J Roy Stat Soc B 70(4):825–848

    Article  MathSciNet  Google Scholar 

  • Comer ML, Delp EJ (1999) Segmentation of textured images using a multiresolution gaussian autoregressive model. IEEE Trans Image Process 8(3):408–420

    Article  Google Scholar 

  • Cotton WR, Bryan G, Van den Heever SC (2010) Storm and cloud dynamics, volume 99. Academic Press

  • Cressie N, Johannesson G (2008) Fixed rank kriging for very large spatial data sets. J Roy Stat Soc B 70(1):209–226

    Article  MathSciNet  Google Scholar 

  • Cressie N, Wikle CK (2011) Statistics for spatio-temporal data. Wiley, Hoboken, NJ

    MATH  Google Scholar 

  • Datta A, Banerjee S, Finley AO, Gelfand AE (2016) Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J Am Stat Assoc 111(514):800–812

    Article  MathSciNet  Google Scholar 

  • Du J, Zhang H, Mandrekar VS (2009) Fixed-domain asymptotic properties of tapered maximum likelihood estimators. Ann Stat 37:3330–3361

    Article  MathSciNet  Google Scholar 

  • Duvenaud D, Lloyd JR, Grosse R, Tenenbaum JB, Ghahramani Z (2013) Structure discovery in nonparametric regression through compositional kernel search. In: Proceedings of the 30th international conference on machine learning, vol 28, pp 1166–1174

  • Ferreira MA, Lee HK (2007) Multiscale modeling: a Bayesian perspective. Springer Science & Business Media, Berlin

    MATH  Google Scholar 

  • Ferreira MA, West M, Lee HK, Higdon DM et al (2006) Multi-scale and hidden resolution time series models. Bayesian Anal 1(4):947–967

    Article  MathSciNet  Google Scholar 

  • Furrer R, Genton MG, Nychka D (2006) Covariance tapering for interpolation of large spatial datasets. J Comput Graph Stat 15(3):502–523

    Article  MathSciNet  Google Scholar 

  • Gneiting T, Katzfuss M (2014) Probabilistic forecasting. Ann Rev Stat Appl 1(1):125–151

    Article  Google Scholar 

  • Gotway CA, Young LJ (2002) Combining incompatible spatial data. J Am Stat Assoc 97(458):632–648

    Article  MathSciNet  Google Scholar 

  • Guinness J (2018) Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics 60(4):415–429

    Article  MathSciNet  Google Scholar 

  • Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, Zammit-Mangion A (2019) A case study competition among methods for analyzing large spatial data. J Agric Biol Environ Stat 24(3):398–425

    Article  MathSciNet  Google Scholar 

  • Higdon D (1998) A process-convolution approach to modelling temperatures in the North Atlantic Ocean. Environ Ecol Stat 5(2):173–190

    Article  Google Scholar 

  • Huang H-C, Cressie N, Gabrosek J (2002) Fast, resolution-consistent spatial prediction of global processes from satellite data. J Comput Graph Stat 11(1):63–88

    Article  MathSciNet  Google Scholar 

  • Katzfuss M (2017) A multi-resolution approximation for massive spatial datasets. J Am Stat Assoc 112(517):201–214

    Article  MathSciNet  Google Scholar 

  • Katzfuss M, Cressie N (2011) Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets. J Time Ser Anal 32(4):430–446

    Article  MathSciNet  Google Scholar 

  • Katzfuss M, Gong W (2020) A class of multi-resolution approximations for large spatial datasets. Stat Sin 30(4):2203–2226

    MATH  Google Scholar 

  • Katzfuss M, Guinness J (2021) A general framework for Vecchia approximations of Gaussian processes. Stat Sci 36(1):124–141

    Article  MathSciNet  Google Scholar 

  • Katzfuss M, Guinness J, Gong W, Zilber D (2020) Vecchia approximations of Gaussian-process predictions. J Agric Biol Environ Stat 25(3):383–414

    Article  MathSciNet  Google Scholar 

  • Katzfuss M, Jurek M, Zilber D, Gong W, Guinness J, Zhang J, Schäfer F (2020b) GPvecchia: fast Gaussian-process inference using Vecchia approximations. R package version 0.1.3

  • Kaufman CG, Schervish MJ, Nychka DW (2008) Covariance tapering for likelihood-based estimation in large spatial data sets. J Am Stat Assoc 103(484):1545–1555

    Article  MathSciNet  Google Scholar 

  • Kim S-W, Yoon S-C, Kim J, Kim S-Y (2007) Seasonal and monthly variations of columnar aerosol optical properties over east asia determined from multi-year modis, lidar, and aeronet sun/sky radiometer measurements. Atmos Environ 41(8):1634–1651

    Article  Google Scholar 

  • Lindgren F, Rue H, Lindström J (2011) An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J Roy Stat Soc B 73(4):423–498

    Article  MathSciNet  Google Scholar 

  • Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959

    MathSciNet  MATH  Google Scholar 

  • Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Sang H, Jun M, Huang JZ (2011) Covariance approximation for large multivariate spatial datasets with an application to multiple climate model errors. Ann Appl Stat 5(4):2519–2548

    Article  MathSciNet  Google Scholar 

  • Saquib SS, Bouman CA, Sauer K (1996) A non-homogeneous mrf model for multiresolution bayesian estimation. In: Proceedings of 3rd IEEE international conference on image processing, vol 2, pp 445–448. IEEE

  • Schäfer F, Katzfuss M, Owhadi H (2020) Sparse Cholesky factorization by Kullback-Leibler minimization. arXiv:2004.14455

  • Schäfer F, Sullivan TJ, Owhadi H (2017) Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. arXiv:1706.02205

  • Skøien JO, Blöschl G, Western A (2003) Characteristic space scales and timescales in hydrology. Water Resour Res, 39(10)

  • Snelson E, Ghahramani Z (2007) Local and global sparse Gaussian process approximations. In: Artif Intell Stati 11 (AISTATS)

  • Sobolewska MA, Siemiginowska A, Kelly BC, Nalewajko K (2014) Stochastic modeling of the Fermi/LAT \(\gamma \)-ray blazar variability. Astrophys J, 786(143)

  • Tzeng S, Huang H-C, Cressie N (2005) A fast, optimal spatial-prediction method for massive datasets. J Am Stat Assoc 100(472):1343–1357

    Article  MathSciNet  Google Scholar 

  • Vecchia A (1988) Estimation and model identification for continuous spatial processes. J Roy Stat Soc B 50(2):297–312

    MathSciNet  Google Scholar 

  • Wikle CK, Cressie N (1999) A dimension-reduced approach to space-time Kalman filtering. Biometrika 86(4):815–829

    Article  MathSciNet  Google Scholar 

  • Wilson AG, Adams RP (2013) Gaussian process kernels for pattern discovery and extrapolation. In: Proceedings of the 30th international conference on machine learning

  • Wilson AG, Gilboa E, Nehorai A, Cunningham JP (2014) Fast kernel learning for multidimensional pattern extrapolation. In: Advances in neural information processing systems, pp 3626–3634

  • Zhu J, Morgan CL, Norman JM, Yue W, Lowery B (2004) Combined mapping of soil properties using a multi-scale tree-structured spatial model. Geoderma 118(3–4):321–334

    Article  Google Scholar 

Download references

Acknowledgements

Katzfuss’ research was partially supported by National Science Foundation (NSF) Grants DMS–1521676, DMS–1654083, and DMS–1953005. We would like to thank David Jones for helpful comments. Part of our MSV implementation was inspired by the R package GPvecchia (Katzfuss et al. 2020b), and it relies on a fast maximum–minimum distance ordering algorithm by Florian Schaefer.

Author information

Authors and Affiliations

Authors

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Computing \(\mathbf {U}\)

Extending the derivations in Section 2.4.1, the sparse upper triangular matrix \(\mathbf {U}\) can be specified by the following rules:

(1) For each \(\ell = 1,2,..., L-1\), denote \(\mathbf {U}^{(\ell )}\) as the block of \(\mathbf {U}\) corresponding to level \(\ell \) with size \(n_\ell \times n_\ell \). For each \(i=1,2,...,n_\ell \),

$$\begin{aligned} \mathbf {U}^{(\ell )}_{ii} = (D_i^{(\ell )} )^{-1/2}. \end{aligned}$$

For the conditioning set of \(y^{(\ell )}_i\), suppose the s-th element in its conditioning set is \(y^{(\ell )}_{i'}\), then

$$\begin{aligned} \mathbf {U}^{(\ell )}_{i'i} = -\{B_i^{(\ell )}\}^s (D_i^{(\ell )} )^{-1/2}, \end{aligned}$$

where \(\{B_i^{(\ell )}\}^s\) is the s-th element of \(B_i^{(\ell )}\).

(2) For the data level L, first denote an \(n \times n\) diagonal matrix by

$$\begin{aligned} \mathbf {U}^{(L)(L)}=diag\left( (D_1^{(L)} )^{-1/2},(D_2^{(L)} )^{-1/2},..., (D_n^{(L)} )^{-1/2}\right) . \end{aligned}$$

Next, for \(\ell =1,2,..., L-1\), denote an \(n_\ell \times n\) matrix \(\mathbf {U}^{(L)(\ell )}\) as the block of \(\mathbf {U}\) corresponding to \(\{ N_{z_i}^{(\ell )}, i=1,2,...,n\}\). Then, for each i, suppose the s-th element in \(N_{z_i}^{(\ell )}\) is \(y_{i'}^{(\ell )}\), then

$$\begin{aligned} \mathbf {U}^{(L)(\ell )}_{i'i} = -\{B_i^{(L)}\}^s (D_i^{(L)} )^{-1/2}. \end{aligned}$$

(3) Finally, the matrix \(\mathbf {U}\) is

All unmentioned entries in \(\mathbf {U}\) are 0.

Prediction at unobserved locations

For simplicity in the proof, denote \(\mathcal {S}\) as the locations corresponding to the knot set at level \(\ell \). First, we show \(\left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} =\mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top }\). When \(\ell =1\), denote \(\mathbf {U}= \left( \begin{array}{cc} \mathbf {U}_1 &{} \mathbf {U}_2 \\ \mathbf{0}&{} \mathbf {U}_3 \end{array} \right) \), where \(\mathbf {U}_1= \mathbf {U}^{(1)}\) is the block of \(\mathbf {U}\) corresponding to knot variables at level 1. Then,

$$\begin{aligned} \hat{\mathbf {C}}^{-1}=\mathbf {U} \mathbf {U}^ \top =\left( \begin{array}{cc} \mathbf {U}_1 &{} \mathbf {U}_2 \\ \mathbf{0}&{} \mathbf {U}_3 \end{array} \right) \left( \begin{array}{cc} \mathbf {U}_1^\top &{} \mathbf{0}\\ \mathbf {U}_2^\top &{} \mathbf {U}_3^\top \end{array} \right) = \left( \begin{array}{cc} \mathbf {U}_1 \mathbf {U}_1^\top +\mathbf {U}_2 \mathbf {U}_2^\top &{} \mathbf {U}_2\mathbf {U}_3^\top \\ \mathbf {U}_3\mathbf {U}_2^\top &{} \mathbf {U}_3\mathbf {U}_3^\top \end{array} \right) . \end{aligned}$$

Since we can also write \(\hat{\mathbf {C}}^{-1}\) as \(\hat{\mathbf {C}}^{-1}=\left( \begin{array}{cc} C^{(1)}(\mathcal {S}, \mathcal {S}) &{} A \\ A^\top &{} B \end{array} \right) ^{-1}=\left( \begin{array}{cc} E &{} F \\ F^\top &{} G \end{array} \right) \), by the property of matrix inverse in block form, \(C^{(1)}(\mathcal {S}, \mathcal {S})^{-1}=E-FG^{-1}F^\top \). Thus we have

$$\begin{aligned} C^{(1)}(\mathcal {S}, \mathcal {S})^{-1}=\mathbf {U}_1 \mathbf {U}_1^\top + \mathbf {U}_2 \mathbf {U}_2^\top -\mathbf {U}_2\mathbf {U}_3^\top (\mathbf {U}_3\mathbf {U}_3^\top )^{-1}\mathbf {U}_3\mathbf {U}_2^\top =\mathbf {U}_1 \mathbf {U}_1^\top = \mathbf {U}^{(1)}\mathbf {U}^{(1)\top }. \end{aligned}$$

For any \(\ell >1\), similar results hold: \( C^{(\ell )}(\mathcal {S}, \mathcal {S})^{-1}= \mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top }. \)

Using Algorithm 2 and achieving a KL divergence of (almost) zero, the MSV approximation based on the knot set \(\mathbf {y}_\ell \) at level \(\ell \) is (almost) exact, and so we assume that all information about the process at level \(\ell \) is captured by knot set \(\mathbf {y}_\ell \). Then, the posterior mean in (5) at an unobserved location \(\mathbf {s}_0\) can be computed as

$$\begin{aligned} E(y^{(\ell )}(\mathbf {s}_0) |\mathbf {z}) =&E\left( E(y^{(\ell )}(\mathbf {s}_0) |\mathbf {y}_\ell ,\mathbf {z}) |\mathbf {z}\right) = E\left( E(y^{(\ell )}(\mathbf {s}_0) |\mathbf {y}_\ell ) |\mathbf {z}\right) \\ =&C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} E(\mathbf {y}_\ell |\mathbf {z})\\ =&C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top } \mathbf {H}_{\ell }\varvec{\mu }. \end{aligned}$$

The posterior variance in (6) can be computed as

$$\begin{aligned}&var(y^{(\ell )}(\mathbf {s}_0) |\mathbf {z}) \nonumber \\ =&E\left( var(y^{(\ell )}(\mathbf {s}_0) |\mathbf {y}_\ell ,\mathbf {z}) |\mathbf {z}\right) + var\left( E(y^{(\ell )}(\mathbf {s}_0) |\mathbf {y}_\ell ,\mathbf {z}) |\mathbf {z}\right) \nonumber \\ =&E\left( var(y^{(\ell )}(\mathbf {s}_0) |\mathbf {y}_\ell ) |\mathbf {z}\right) + var\left( E(y^{(\ell )}(\mathbf {s}_0) |\mathbf {y}_\ell ) |\mathbf {z}\right) \nonumber \\ =&E \left( C^{(\ell )}(\mathbf {s}_0, \mathbf {s}_0)- C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} C^{(\ell )}(\mathcal {S},\mathbf {s}_0) |\mathbf {z}\right) \nonumber \\&+ var\left( C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} \mathbf {y}_\ell |\mathbf {z}\right) \nonumber \\ =&\, C^{(\ell )}(\mathbf {s}_0, \mathbf {s}_0)- C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} C^{(\ell )}(\mathcal {S},\mathbf {s}_0)\nonumber \\&+ C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} var(\mathbf {y}_\ell |\mathbf {z}) \left( C^{(\ell )}(\mathcal {S}, \mathcal {S}) \right) ^{-1} C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) ^\top \nonumber \\ =&\,C^{(\ell )}(\mathbf {s}_0, \mathbf {s}_0)- C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top } C^{(\ell )}(\mathcal {S},\mathbf {s}_0)\nonumber \\&+ C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) \mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top } \varvec{\Sigma }_{\mathbf {H}_{\ell }} \mathbf {U}^{(\ell )}\mathbf {U}^{(\ell )\top } C^{(\ell )}(\mathbf {s}_0, \mathcal {S}) ^\top . \end{aligned}$$
(7)

The posterior predictive variance for the entire latent process is

$$\begin{aligned}&var(y^{(1)}(\mathbf {s}_0)+y^{(2)}(\mathbf {s}_0) |\mathbf {z}) \\&\quad = E\left( var \left( y^{(1)}(\mathbf {s}_0)+y^{(2)}(\mathbf {s}_0) |\mathbf {y}_1,\mathbf {y}_2,\mathbf {z}\right) |\mathbf {z}\right) + var\left( E \left( y^{(1)}(\mathbf {s}_0)+y^{(2)}(\mathbf {s}_0) |\mathbf {y}_1,\mathbf {y}_2,\mathbf {z}\right) |\mathbf {z}\right) \\&\quad = E\left( var \left( y^{(1)}(\mathbf {s}_0)+y^{(2)}(\mathbf {s}_0) |\mathbf {y}_1,\mathbf {y}_2 \right) |\mathbf {z}\right) + var\left( E \left( y^{(1)}(\mathbf {s}_0)+y^{(2)}(\mathbf {s}_0) |\mathbf {y}_1,\mathbf {y}_2 \right) |\mathbf {z}\right) \\&\quad = E\left( \left( var(y^{(1)}(\mathbf {s}_0)|\mathbf {y}_1) + var(y^{(2)}(\mathbf {s}_0) |\mathbf {y}_2)\right) |\mathbf {z}\right) + var\left( \left( E(y^{(1)}(\mathbf {s}_0)|\mathbf {y}_1)+ E(y^{(2)}(\mathbf {s}_0) |\mathbf {y}_2) \right) |\mathbf {z}\right) \\&\quad = E\left( var(y^{(1)}(\mathbf {s}_0)|\mathbf {y}_1) |\mathbf {z}\right) + E \left( var(y^{(2)}(\mathbf {s}_0) |\mathbf {y}_2) |\mathbf {z}\right) \\&\qquad + var\left( C^{(1)}(\mathbf {s}_0, \mathcal {S}_1) \left( C^{(1)}(\mathcal {S}_1, \mathcal {S}_1) \right) ^{-1} \mathbf {y}_1 + C^{(2)}(\mathbf {s}_0, \mathcal {S}_2) \left( C^{(2)}(\mathcal {S}_2, \mathcal {S}_2) \right) ^{-1} \mathbf {y}_2|\mathbf {z}\right) \end{aligned}$$

The first two terms can be computed similar to (7). The last term is a linear combination of \(\mathbf {y}_{1:L-1}\), and so it can be calculated by \(var (\mathbf {H}\mathbf {y}_{1:L-1}|z) = (\mathbf {V}^{-1} \mathbf {H}^\top )^\top (\mathbf {V}^{-1} \mathbf {H}^\top )\).

Proofs

Proposition 1

For a polynomial \(y^{(\ell )}(\mathbf {s}) = \mathbf {p}(\mathbf {s})^\top \varvec{\beta }\) as a function of spatial location \(\mathbf {s}\) with p coefficients \(\varvec{\beta } \sim \mathcal {N}_p(\mathbf {0},\varvec{\Sigma }_\beta )\), the corresponding covariance function \(C^{(\ell )}(\mathbf {s}_i, \mathbf {s}_j) = \mathbf {p}(\mathbf {s}_i)^\top \varvec{\Sigma }_\beta \mathbf {p}(\mathbf {s}_j)\) can be captured exactly by setting the knot and conditioning set to be any distinct p locations.

Proof of Proposition 1

Denote any p distinct locations as \(\{\mathbf {s}_1, \mathbf {s}_2,..., \mathbf {s}_p\}\). For polynomial \(y(\mathbf {s}) = \mathbf {p}(\mathbf {s})^\top \varvec{\beta }\) with \(\varvec{\beta } \sim \mathcal {N}_p(\mathbf {0},\varvec{\Sigma }_\beta )\), the system of equations \(\{\mathbf {p}(\mathbf {s}_1)^\top \varvec{\beta } = y(\mathbf {s}_1) , \mathbf {p}(\mathbf {s}_2)^\top \varvec{\beta } = y(\mathbf {s}_2) ,..., \mathbf {p}(\mathbf {s}_p)^\top \varvec{\beta } = y(\mathbf {s}_p) , \mathbf {p}(\mathbf {s})^\top \varvec{\beta } = y(\mathbf {s}) \}\) is equivalent to the system of equations \(\{\mathbf {p}(\mathbf {s}_1)^\top \varvec{\beta } = y(\mathbf {s}_1) , \mathbf {p}(\mathbf {s}_2)^\top \varvec{\beta } = y(\mathbf {s}_2) ,..., \mathbf {p}(\mathbf {s}_p)^\top \varvec{\beta } = y(\mathbf {s}_p) \}\), thus \(P \left( y(\mathbf {s}) | y(\mathbf {s}_1), y(\mathbf {s}_2),...,y(\mathbf {s}_p) \right) = 1\). Then, the exact distribution for \(\mathbf {y}=(y(\mathbf {s}_1), y(\mathbf {s}_2), ..., y(\mathbf {s}_n))\) can be written as

$$\begin{aligned} f(\mathbf {y})=&\prod _{i=1}^{n} f(y(\mathbf {s}_i)| y(\mathbf {s}_{h_i}))\\ =&f(y(\mathbf {s}_1)) f(y(\mathbf {s}_2)|y(\mathbf {s}_1)) f(y(\mathbf {s}_3)|y(\mathbf {s}_1),y(\mathbf {s}_2)) \cdots f(y(\mathbf {s}_p)|y(\mathbf {s}_1),y(\mathbf {s}_2), ...,y(\mathbf {s}_{p-1}) ) \cdot \\&\prod _{i=p+1}^{n} f(y(\mathbf {s}_i)| y(\mathbf {s}_1), y(\mathbf {s}_2),..., y(\mathbf {s}_p)), \end{aligned}$$

which equals \(\hat{f}(\mathbf {y})\) in Vecchia by setting the knot and conditioning set to be \(\{\mathbf {s}_1, \mathbf {s}_2, ...,\mathbf {s}_p\}\). Thus, the covariance can be captured exactly. \(\square \)

Proof of Theorem 1

The following proof is related to Guinness (2018, Thm. 1). Suppose the true covariance is \( \Sigma _0\) and the approximated covariance is \(\hat{\Sigma }\). At each level \(\ell \), the KL divergence between the two normal distributions can be written as \( KL \left( f(\mathbf {y}^{(\ell )}) || \hat{f}(\mathbf {y}^{(\ell )}) \right) =\frac{1}{2} E \left( -(\mathbf {y}^{(\ell )})^\top \Sigma _0^{-1}\mathbf {y}^{(\ell )} \right) +\frac{1}{2} E \left( (\mathbf {y}^{(\ell )})^\top \hat{\Sigma }^{-1}\mathbf {y}^{(\ell )}\right) +\frac{1}{2} \log \frac{ |\hat{\Sigma } |}{|\Sigma _0 |}\). Since \(\Sigma _0\) is the true covariance, the first term \(E \left( -(\mathbf {y}^{(\ell )})^\top \Sigma _0^{-1}\mathbf {y}^{(\ell )} \right) = -n\). Based on MSV, we have \(\log |\hat{\Sigma } |=\sum _{i=1}^{n} \log D_i^{(\ell )}\). Suppose \(L_0\) is the Cholesky factor of \(\Sigma _0\), then \(E \left( (\mathbf {y}^{(\ell )})^\top \hat{\Sigma }^{-1}\mathbf {y}^{(\ell )}\right) =tr(UU^T \Sigma _0)= \sum _{i,j} (L_0^T U)_{ij}^2 =n\). Thus, the KL divergence can be written as \(KL \left( f(\mathbf {y}^{(\ell )}) || \hat{f}(\mathbf {y}^{(\ell )}) \right) =\frac{1}{2} \left( -n+n+ \sum _{i=1}^{n} \log D_i^{(\ell )}-\log |\Sigma _0 |\right) =\frac{1}{2} \sum _{i=1}^{n} \log D_i^{(\ell )}-constant\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Katzfuss, M. Multi-Scale Vecchia Approximations of Gaussian Processes. JABES 27, 440–460 (2022). https://doi.org/10.1007/s13253-022-00488-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13253-022-00488-0

Keywords

Navigation