Abstract
Kriging is a widely employed technique across computer experiments, machine learning and geostatistics. An important challenge for kriging is its high costs when dealing with large datasets. This article focuses on a class of methods aiming at decreasing this computational burden by aggregating kriging predictors based on smaller data subsets. More precisely, it shows that aggregation methods that ignore the covariance between sub-models typically yield inconsistent predictions, whereas the nested kriging method enjoys several attractive properties: it is consistent, it can be interpreted as an exact conditional distribution for a modified prior, and the conditional covariances given the observations can be computed efficiently. This article also includes a theoretical and numerical analysis of how the assignment of the observation points to the sub-models can affect the prediction ability of the aggregated model. Finally, the nested kriging method is extended to measurement errors and to universal kriging.
Similar content being viewed by others
References
Abrahamsen P (1997) A review of Gaussian random fields and correlation functions. Technical report, Norwegian Computing Center
Allard D, Comunian A, Renard P (2012) Probability aggregation methods in geoscience. Math Geosci 44(5):545–581
Bacchi V, Jomard H, Scotti O, Antoshchenkova E, Bardet L, Duluc CM, Hebert H (2020) Using meta-models for tsunami hazard analysis: an example of application for the French Atlantic coast. Front Earth Sci 8(41):1–17
Bachoc F (2013) Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model mispecification. Comput Stat Data Anal 66:55–69
Bachoc F, Ammar K, Martinez JM (2016) Improvement of code behavior in a design of experiments by metamodeling. Nucl Sci Eng 183(3):387–406
Bachoc F, Lagnoux A, Nguyen TMN (2017) Cross-validation estimation of covariance parameters under fixed-domain asymptotics. J Multivar Anal 160:42–67
Banerjee S, Gelfand AE, Finley AO, Sang H (2008) Gaussian predictive process models for large spatial data sets. J R Stat Soc Ser B (Stat Methodol) 70(4):825–848
Cao Y, Fleet DJ (2014) Generalized product of experts for automatic and principled fusion of gaussian process predictions. In: Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, Montreal, pp 1–5
Chevalier C, Ginsbourger D (2013) Fast computation of the multi-points expected improvement with applications in batch selection. In: Learning and intelligent optimization. Springer, Berlin, pp 59–69
Chilès JP, Delfiner P (2012) Geostatistics: modeling spatial uncertainty, vol 713. Wiley, New York
Chilès JP, Desassis N (2018) Fifty years of Kriging. Handbook of mathematical geosciences. Springer, Cham, pp 589–612
Cressie N (1990) The origins of Kriging. Math Geol 22(3):239–252
Cressie N (1993) Statistics for spatial data. Wiley, New York
Cressie N, Johannesson G (2008) Fixed rank Kriging for very large spatial data sets. J R Stat Soc Ser B (Stat Methodol) 70(1):209–226
Datta A, Banerjee S, Finley AO, Gelfand AE (2016) Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J Am Stat Assoc 111(514):800–812
Davis BJK, Curriero FC (2019) Development and evaluation of geostatistical methods for non-Euclidean-based spatial covariance matrices. Math Geosci 51(6):767–791
Deisenroth MP, Ng JW (2015) Distributed Gaussian processes. In: Proceedings of the 32nd international conference on machine learning, Lille, France JMLR: W&CP, vol 37
Finley AO, Sang H, Banerjee S, Gelfand AE (2009) Improving the performance of predictive process modeling for large datasets. Comput Stat Data Anal 53(8):2873–2884
Furrer R, Genton MG, Nychka D (2006) Covariance tapering for interpolation of large spatial datasets. J Comput Graph Stat 15(3):502–523
He J, Qi J, Ramamohanarao K (2019) Query-aware Bayesian committee machine for scalable Gaussian process regression. In: Proceedings of the 2019 SIAM international conference on data mining. SIAM, pp 208–216
Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka D, Sun F, Zammit-Mangion A (2019) A case study competition among methods for analyzing large spatial data. J Agric Biol Environ Stat 24(3):398–425
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Uncertainty in artificial intelligence, pp 282–290
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black box functions. J Global Optim 13:455–492
Kaufman CG, Schervish MJ, Nychka DW (2008) Covariance tapering for likelihood-based estimation in large spatial data sets. J Am Stat Assoc 103(484):1545–1555
Krige DG (1951) A statistical approach to some basic mine valuation problems on the Witwatersrand. J South Afr Inst Min Metall 52(6):119–139
Krityakierne T, Baowan D (2020) Aggregated GP-based optimization for contaminant source localization. Oper Res Perspect 7:100151
Liu H, Cai J, Wang Y, Ong Y S (2018) Generalized robust Bayesian committee machine for large-scale Gaussian process regression. In: Proceedings of machine learning research, vol 80, pp 3131–3140, International Conference on Machine Learning 2018
Liu H, Ong Y, Shen X, Cai J (2020) When Gaussian process meets big data: a review of scalable GPs. IEEE Trans Neural Netw Learn Syst 31:4405–4423
Marrel A, Iooss B, Laurent B, Roustant O (2009) Calculations of Sobol indices for the Gaussian process metamodel. Reliab Eng Syst Saf 94(3):742–751
Matheron G (1970) La Théorie des Variables Régionalisées et ses Applications. Fascicule 5 in Les Cahiers du Centre de Morphologie Mathématique de Fontainebleau, Ecole Nationale Supérieure des Mines de Paris
Putter H, Young A (2001) On the effect of covariance function estimation on the accuracy of Kriging predictors. Bernoulli 7(3):421–438
Quinonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge
Roustant O, Ginsbourger D, Deville Y (2012) DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J Stat Softw 51(1):1–55
Rue H, Held L (2005) Gaussian Markov random fields, Theory and applications. Chapman & Hall, Boca Raton
Rullière D, Durrande N, Bachoc F, Chevalier C (2018) Nested Kriging predictions for datasets with a large number of observations. Stat Comput 28(4):849–867
Sacks J, Welch WJ, Mitchell TJ, Wynn HP (1989) Design and analysis of computer experiments. Stat Sci 4:409–423
Santner TJ, Williams BJ, Notz WI (2013) The design and analysis of computer experiments. Springer, Berlin
Stein ML (2012) Interpolation of spatial data: some theory for Kriging. Springer, Berlin
Stein ML (2014) Limitations on low rank approximations for covariance matrices of spatial data. Spatial Stat 8:1–19
Sun X, Luo XS, Xu J, Zhao Z, Chen Y, Wu L, Chen Q, Zhang D (2019) Spatio-temporal variations and factors of a provincial pm 2.5 pollution in eastern china during 2013–2017 by geostatistics. Sci Rep 9(1):1–10
Tresp V (2000) A Bayesian committee machine. Neural Comput 12(11):2719–2741
van Stein B, Wang H, Kowalczyk W, Bäck T, Emmerich M (2015) Optimally weighted cluster Kriging for big data regression. In: International symposium on intelligent data analysis. Springer, pp 310–321
van Stein B, Wang H, Kowalczyk W, Emmerich M, Bäck T (2020) Cluster-based Kriging approximation algorithms for complexity reduction. Appl Intell 50(3):778–791
Vazquez E, Bect J (2010a) Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J Stat Plann Inference 140(11):3088–3095
Vazquez E, Bect J (2010b) Pointwise consistency of the kriging predictor with known mean and covariance functions. In: Giovagnoli A, Atkinson AC, Torsney B, May C (eds) mODa 9—Advances in model-oriented design and analysis. Physica-Verlag HD, Heidelberg, pp 221–228. ISBN 978-3-7908-2410-0
Ying Z (1991) Asymptotic properties of a maximum likelihood estimator with data from a Gaussian process. J Multivar Anal 36:280–296
Zhang H, Wang Y (2010) Kriging and cross validation for massive spatial data. Environmetrics 21:290–304
Zhu Z, Zhang H (2006) Spatial sampling design under the infill asymptotic framework. Environmetrics 17(4):323–337
Funding
Part of this research was conducted within the frame of the Chair in Applied Mathematics OQUAIDO, which gathers partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (Ecole Centrale de Lyon, Mines Saint-Étienne, University of Nice, University of Toulouse and CNRS) around advanced methods for computer experiments. The authors F. Bachoc and D.Rullière acknowledge support from the regional MATH-AmSud program, Grant Number 20-MATH-03. The authors are grateful to the Editor-in-Chief, an Associate Editor and all referees for their constructive suggestions that led to an improvement in the manuscript
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1. Proof of Proposition 1
For \(v \in \mathbb {R}^m\), we let \(|v| = \max _{i=1,\ldots ,m} |v_i|\) and \(B(v,r) = \{w \in \mathbb {R}^m,|v-w| \le r\}\).
Let \(x_0,{\bar{x}} \in D\), \(r_{x_0}>0\) and \(r_{{\bar{x}}}>0\) be fixed and satisfy \(B(x_0,r_{x_0}) \subset D\), \(B({\bar{x}},r_{{\bar{x}}}) \subset D\), \(B(x_0,r_{x_0}) \cap B({\bar{x}},r_{{\bar{x}}}) = \varnothing \) and \(k(x_0,{\bar{x}}) >0\). [The existence is implied by the assumptions of the proposition.] By continuity of k, \(r_{x_0} >0\) and \(r_{{\bar{x}}} >0\) can be selected small enough so that, with some fixed \(\epsilon _2 >0\) and \(\delta _1 >0\), for \(v \in B( x_0 , r_{x_0} )\) and \(w \in B( {\bar{x}} , r_{{\bar{x}}} )\), \(| v - w | \ge \delta _1\), \( k(x_0,x_0) /2 \le k(v,v) \le 2k(x_0,x_0)\), \( k({\bar{x}},{\bar{x}}) /2 \le k(w,w) \le 2k({\bar{x}},{\bar{x}})\) and
For \(\delta >0\), let
Then \(V(\delta )>0\) because of the NEB, by continuity of k and by compacity.
Consider a decreasing sequence \(\delta _n\) of non-negative numbers such that \(\delta _n \rightarrow _{n \rightarrow \infty } 0\), and which will be specified below. There exists a sequence \((u_n)_{n \in \mathbb {N}} \in D^{\mathbb {N}}\), composed of pairwise distinct elements, such that \(\lim _{n \rightarrow \infty } \sup _{x \in D}\min _{i=1,\ldots ,n} | u_{i} - x | = 0\), and such that for all n
Such a sequence indeed exists from Lemma 2 below.
Consider then a sequence \((w_n)_{n \in \mathbb {N}} \in D^{\mathbb {N}}\) such that for all n, \(w_n = {\bar{x}} -(r_{{\bar{x}}}/(1+n)) e_1\) with \(e_1=(1,0,\ldots ,0)\). We can assume furthermore that \(\{u_n \}_{n \in \mathbb {N}}\) and \(\{w_n \}_{n \in \mathbb {N}}\) are disjoint (this almost surely holds with the construction of Lemma 2 for \((u_n)\)).
Let us now consider two sequences of integers \(p_n\) and \(k_n\) with \(k_n \rightarrow \infty \) and \(p_n \rightarrow \infty \) to be specified later. Let \(C_n\) be the largest natural number m satisfying \(m (p_n-1) < n\). Let \(X = (X_1,\ldots ,X_{p_n})\) be defined by, for \(i=1,\ldots ,k_n\), \(X_i = ( u_j)_{ j=(i-1)C_n + 1,\ldots ,i C_n }\); for \(i=k_n+1,\ldots ,p_n-1\), \(X_i = ( w_j)_{j=(i-k_n-1) C_n + 1,\ldots ,(i - k_n) C_n }\); and \(X_{p_n} = ( w_j)_{j=(p_n-k_n-1) C_n + 1,\ldots ,n- k_n C_n }\). With this construction, note that \(X_{p_n}\) is nonempty. Furthermore, the sequence of vectors \(X = (X_{1},\ldots ,X_{p_n})\), indexed by \(n \in \mathbb {N}\), defines a triangular array of observation points satisfying the conditions of the proposition.
Let us discuss the construction of \((u_n)_{n \in \mathbb {N}}\), \((w_n)_{n \in \mathbb {N}}\), \(k_n\), \(C_n\) and \(p_n\) more informally. The sequence \((u_n)_{n \in \mathbb {N}}\) is dense in D, and \(X_1,\ldots ,X_{k_n}\) are composed of the \(k_n C_n\) first points of this sequence. Then, \(X_{k_n+1},\ldots ,X_{p_n}\) are composed of the \(n - C_n k_n\) first points of the sequence \((w_n)_{n \in \mathbb {N}}\), which is concentrated around \({\bar{x}}\). We will let \(k_n / p_n \rightarrow 0\) so that the majority of the groups in X contain points of \((w_n)_{n \in \mathbb {N}}\), so that they do not contain relevant information on the values of Y on \(B(x_0,r_{x_0})\) and yield an inconsistency of the aggregated predictor \(M_{{\mathcal {A}},n}\) on \(B(x_0,r_{x_0})\).
Coming back to the proof, observe that \(\inf _{i \in \mathbb {N}} \inf _{x \in B(x_0,r_{x_0})} |w_i - x| \ge \delta _1\) and let \(\epsilon _1 = V(\delta _1) >0\). Then, we have for all \(n \in \mathbb {N}\), for all \(x \in B(x_0,r_{x_0})\), and for all \(k=k_n+1,\ldots ,p_n\), since then \(X_k\) is nonempty and only contains elements \(w_i \in B({\bar{x}},r)\), from Eq. (28)
Let \(\mathcal {E}_n = \{x \in B(x_0,r_{x_0}) ; \min _{i=1,\ldots ,n} | x - u_i | \ge \delta _n \}\) and let \(x \in \mathcal {E}_n\). Since x is not a component of X, we have \(v_k(x) >0\) for all k. Also, \(v_{p_n}(x) < k(x,x)\) from Eq. (29). Hence, \(M_{{\mathcal {A}},n}(x)\) is well-defined.
For two random variables A and B, we let \(||A-B|| = ({{\,\mathrm{\mathrm {E}}\,}}\left[ (A-B)^2\right] )^{1/2}\). For \(x \in \mathcal {E}_n\) let
Then, from the triangular inequality, and since from the law of total variance, \(|| M_k(x) || \le || Y( x ) || = v_\mathrm{prior}(x)^{1/2}\), we have with \(\mathcal {V} = \{ k(x,x) ; x \in B(x_0,r(x_0)) \} \)
where the last inequality is obtained from Eq. (29) and the definition of \(\delta _n\) and \(V(\delta )\).
Now, for \(\delta >0\), let \(s(\delta ) = \sup _{v \in \mathcal {V},V(\delta ) \le s^2 \le v } a( s^2 , v )\). Since a is continuous and since \(V(\delta ) >0\), we have that \(s(\delta )\) is finite. Hence, we can choose a sequence \(\delta _n\) of positive numbers such that \(\delta _n \rightarrow _{n \rightarrow \infty } 0\) and \(s(\delta _n) \le \sqrt{n}\) (for instance, let \(\delta _n = \inf \{ \delta \ge n^{-1/2}; s(\delta ) \le n^{1/2} \}\)). Then, we can choose \(p_n = n^{4/5}\) and \(k_n = n^{1/5}\). Then, for large enough n,
Since
is a finite constant, as b is positive and continuous on \(\mathring{\Delta }\), we have that \( \sup _{x \in \mathcal {E}_n} R(x) \rightarrow _{n \rightarrow \infty } 0\). As a consequence, we have from the triangular inequality, for \(x \in \mathcal {E}_n\)
Since \(X_{k_n+1},\ldots ,X_{p_n}\) are composed only of elements of \(\{ w_i \}_{i \in \mathbb {N}}\), we obtain
As a result there exist fixed \(n_0 \in \mathbb {N}\) and \(A >0\) so that for \(n \ge n_0\), \(|| Y(x) - M_{{\mathcal {A}},n}(x) || \ge A\). We thus have, for \(n \ge n_0\)
It remains to be shown that the limit inferior of the volume of \(\mathcal {E}_n\) is not zero in order to show Eq. (4). Let \(N_n\) be the integer part of \(r_{x_0} / 4 \delta _n\). Then, the ball \(B(x_0,r_{x_0})\) contains \((2 N_n)^d\) disjoint balls of the form \(B(a,4 \delta _n)\) with \(a \in B(x_0,r_{x_0})\). If one of these balls \(B(a,4 \delta _n)\) does not intersect with \((u_i)_{i=1\ldots ,n}\), then we can associate to it a ball of the form \(B(s_a,\delta _n) \subset B(a,4 \delta _n) \cap \mathcal {E}_n\). If one of these balls \(B(a,4 \delta _n)\) does intersect with one \(u_j \in \{u_i\}_{i=1\ldots ,n}\), then we can find a ball \(B(s_a, \delta _n/2 ) \subset ( B(u_j, 2 \delta _n) \backslash B(u_j, \delta _n) ) \cap B(a,4 \delta _n) \cap \mathcal {E}_n\). Hence, we have found \((2 N_n)^d\) disjoint balls with radius \(\delta _n/2\) in \(\mathcal {E}_n\). Therefore, \(\mathcal {E}_n\) has volume at least \(2^d ((r_{x_0} / 4 \delta _n) - 1)^d \delta _n^{d} \) which has a strictly positive limit inferior. Hence, Eq. (4) is proved.
Finally, if \( {{\,\mathrm{\mathrm {E}}\,}}\left[ \left( Y(x_0) - M_{{\mathcal {A}},n}(x_0)\right) ^2\right] \rightarrow 0\) as \(n \rightarrow \infty \) for almost all \(x_0 \in D\), then
from the dominated convergence theorem. This is in contradiction to the proof of Eq. (4). Hence, Eq. (5) is proved.
Lemma 2
There exists a sequence \((u_n)_{n \in \mathbb {N}} \in D^{\mathbb {N}}\), composed of pairwise distinct elements, such that
and such that for all n
Proof
Such a sequence can be constructed, for instance, by the following random procedure. Let \(D \subset B(0,R)\) for large enough \(R>0\). Define \(u_1 \in D\) arbitrarily. For \(n=1,2,\ldots \), (1) if the set \( \mathcal {S}_n = \{u \in B(x_0,r_{x_0} ) ; \min _{i=1,\ldots ,n} |u-u_i| > 4 \delta _{n+1} \}\) is nonempty, sample \(u_{n+1}\) from the uniform distribution on \(\mathcal {S}_n\). (2) If \(\mathcal {S}_n\) is empty, sample \({\tilde{u}}_{n+1}\) from the uniform distribution on \(B(0,R) \backslash B(x_0,r_{x_0} )\), and set \(u_{n+1}\) as the projection of \({\tilde{u}}_{n+1}\) on \(D \backslash B(x_0,r_{x_0} )\). One can see that Eq. (31) is satisfied by definition. Furthermore, one can show that Eq. (30) almost surely holds. Indeed, let \(x \in B(x_0,r_{x_0} )\) and \(\epsilon >0\), and assume that with nonzero probability \(B(x,\epsilon ) \cap \{ u_i\}_{i \in \mathbb {N}} = \varnothing \). Then, case (1) occurs infinitely often, and for each i for which case (1) occurs, there is a probability at least \( \epsilon ^d / (2 r_{x_0})^d \) that \(u_i \in B(x,\epsilon )\) (when \(4 \delta _n \le \epsilon / 2\)). This yields a contradiction. Hence, for all \(x \in B(x_0,r_{x_0} )\) and \(\epsilon >0\), almost surely, \(B(x,\epsilon ) \cap \{ u_i\}_{i \in \mathbb {N}} \ne \varnothing \). We similarly show that for all \(x \in D \backslash B(x_0,r_{x_0} )\) and \(\epsilon >0\), almost surely, \(B(x,\epsilon ) \cap \{ u_i\}_{i \in \mathbb {N}} \ne \varnothing \). This shows that Eq. (30) almost surely holds. Hence, a fortiori, there exists a sequence \((u_n)_{n \in \mathbb {N}} \in D^{\mathbb {N}}\) satisfying the conditions of the lemma. \(\square \)
Remark 6
Consider the case \(d=1\). The proof of Proposition 1 can be modified so that the partition \(X_1,\ldots ,X_{p_n}\) also satisfies \(x \le x'\) for any \(x \in X_i\), \(x' \in X_j\), \(1 \le i < j \le p_n\). To see this, consider the same X as in this proof. Let \(X_1,\ldots ,X_{p_n}\) have the same cardinality as in this proof, and let the \(C_n\) smallest elements of X be associated to \(X_1\), the next \(C_n\) smallest be associated to \(X_2\) and so on. Then, one can show that there are at most \(k_n+2\) groups containing elements of \((u_i)_{i \in \mathbb {N}} \cap B(x_0 , r_{x_0}) \) and at least \(p_n - k_n -2\) groups containing only elements of \(B({\bar{x}},r_{{\bar{x}}})\). From these observations, Eqs. (4) and (5) can be proved similarly as in the proof of Proposition 1.
Appendix 2. Proof of Proposition 2
Because D is compact, we have \(\lim _{n \rightarrow \infty } \sup _{x \in D} \min _{i=1,\ldots ,n} || x_{ni} - x || = 0\). Indeed, if this does not hold, there exists \(\epsilon >0\) and a subsequence \(\phi (n)\) such that \(\sup _{x \in D} \min _{i=1,\ldots ,\phi (n)} || x_{\phi (n)i} - x || \ge 2 \epsilon \). Hence, there exists a sequence \(x_{\phi (n)} \in D\) such that \(\min _{i=1,\ldots ,\phi (n)} || x_{\phi (n)i} - x_{\phi (n)} || \ge \epsilon \). Since D is compact, up to extracting a further subsequence, we can also assume that \(x_{\phi (n)} \rightarrow _{n \rightarrow \infty } x_{lim}\) with \(x_{lim} \in D\). This implies that for all large enoughn, \(\min _{i=1,\ldots ,\phi (n)} || x_{\phi (n)i} - x_{lim} || \ge \epsilon / 2\), which is in contradiction to the assumptions of the proposition.
Hence there exists a sequence of positive numbers \(\delta _n\) such that \(\delta _n \rightarrow _{n \rightarrow \infty } 0\) and such that for all \(x \in D\) there exists a sequence of indices \(i_n(x)\) such that \(i_n(x) \in \{1,\ldots ,n\}\) and \(||x - x_{n i_n(x)}|| \le \delta _n\). There also exists a sequence of indices \(j_n(x)\) such that \(x_{ni_n(x)}\) is a component of \(X_{j_n(x)}\). With these notations we have, since \(M_1(x)\),..., \(M_{p_n}(x)\), \(M_{{\mathcal {A}}}(x)\) are linear combinations with minimal square prediction errors
In the rest of the proof we essentially show that, for a dense triangular array of observation points, the kriging predictor that predicts Y(x) based only on the nearest neighbor of x among the observation points has a mean square prediction error that tends to zero uniformly in x when k is continuous. We believe that this fact is somehow known, but we have not been able to find a precise result in the literature. From Eq. (32) we have
Assume now that the above supremum does not go to zero as \(n \rightarrow \infty \). Then there exists \(\epsilon >0\) and two sub-sequences \(x_{\phi (n)}\) and \(t_{\phi (n)}\) with values in D such that \(x_{\phi (n)} \rightarrow _{n \rightarrow \infty } x_{lim}\) and \(t_{\phi (n)} \rightarrow _{n \rightarrow \infty } x_{lim}\), with \(x_{lim} \in D\) and such that \(F(x_{\phi (n)},t_{\phi (n)}) \ge \epsilon \). If \(k(x_{lim},x_{lim}) = 0\) then \(F(x_{\phi (n)},t_{\phi (n)}) \le k(x_{\phi (n)},x_{\phi (n)}) \rightarrow _{n \rightarrow \infty } 0\). If \(k(x_{lim},x_{lim}) > 0\), then for large enough n,
which tends to zero as \(n \rightarrow \infty \) since k is continuous. Hence we have a contradiction, which completes the proof.
Appendix 3. Proofs from Sect. 3.2
First notice that denoting \(k_{\mathcal {A}}(x,x') = {{\,\mathrm{\mathrm {Cov}}\,}}\left[ Y_{\mathcal {A}}(x), Y_{\mathcal {A}}(x')\right] \), we easily get for all \(x, x' \in D\)
A direct consequence of Eq. (33) is \(k_{\mathcal {A}}(x,x) = k(x,x)\), and under the interpolation assumption H2, since \(Y_{\mathcal {A}}(X) = Y(X)\), \(k_{\mathcal {A}}(X,X) = k(X,X)\).
Proof of Proposition 3
The interpolation hypothesis \(M_{\mathcal {A}}(X) = Y(X)\) ensures that \(\varepsilon '_{\mathcal {A}}(X)=0\), so we have
The proof that \(v_{\mathcal {A}}\) is a conditional variance follows the same pattern
\(\square \)
Proof of Proposition 4
Equation (11) is the classical expression of Gaussian conditional covariances, based on the fact that \(Y_{\mathcal {A}}\) is Gaussian. Let us now prove Eq. (12). For a component \(x_k\) of the vector of points X, using the interpolation assumption, we have \(M_{\mathcal {A}}(x_k) = Y(x_k)\) and
Note that \(\alpha _{\mathcal {A}}(x)\) is the \(p \times 1\) vector of aggregation weights of different sub-models at point x, so that \(M_{\mathcal {A}}(x)= {\alpha _{\mathcal {A}}(x)}^tM(x)\) and \( k_{\mathcal {A}}(x,x_k) = {\alpha _{\mathcal {A}}(x)}^t {{\,\mathrm{\mathrm {Cov}}\,}}\left[ M(x), Y(x_k)\right] \). We thus get
Under the linearity assumption, there exists a \(p \times n\) deterministic matrix \(\Lambda (x)\) such that \(M(x)=\Lambda (x) Y(X)\). Thus \(k_{\mathcal {A}}(x,X) = {\alpha _{\mathcal {A}}(x)}^t \Lambda (x) k(X,X)\). As noted in Sect. 3, because of the interpolation condition, \(k_{\mathcal {A}}(X,X)=k(X,X)\) and
Using \(K_M(x,x')={{\,\mathrm{\mathrm {Cov}}\,}}\left[ M(x),M(x')\right] = \Lambda (x) k(X,X) {\Lambda (x')}^t\), we get
Lastly, starting from Eq. (11) and using both Eqs. (33) and (38), we get Eq. (12).
Finally, the development of \({{\,\mathrm{\mathrm {E}}\,}}\left[ \left( Y(x)-M_{\mathcal {A}}(x)\right) \left( Y(x')-M_{\mathcal {A}}(x')\right) \right] \) leads to the right-hand side of Eq. (12), so that
and Eq. (13) holds. \(\square \)
Appendix 4. Proofs From Sect. 3.3
Proof of Proposition 5
Consider \(\Delta (x)\) as defined in Eq. (14). From Eq. (36), using both the linear and the interpolation assumptions, we get \(k(x,X)\Delta (x) = \left[ k(x,X) - k_{\mathcal {A}}(x,X) \right] k(X,X)^{-1}\). Inserting this result into Eq. (14), we have
and the first equality holds. From Eq. (14), we also get \(v_{\mathcal {A}}(x)-v_{\mathrm{full}}(x)=k(x,X)k(X,X)^{-1}k(X,x)-k_{\mathcal {A}}(x,X)k(X,X)^{-1}k_{\mathcal {A}}(X,x)\), and the second equality holds. Note that under the same assumptions, we can also use \(k_{\mathcal {A}}(X,X)=k(X,X)\) and \(k_{\mathcal {A}}(x,x)=k(x,x)\) and start from \(M_{\mathcal {A}}=k_{\mathcal {A}}(x,X)k_{\mathcal {A}}(X,X)^{-1}Y(X)\) and \(v_{\mathcal {A}}(x)= k_{\mathcal {A}}(x,x)-k_{\mathcal {A}}(x,X)k_{\mathcal {A}}(X,X)^{-1}k_{\mathcal {A}}(X,x)\) to get the same results.
Let us now show Eq. (17). The upper bound comes from the fact that \(M_{\mathcal {A}}(x)\) is the best linear combination of \(M_k(x)\) for \(k \in {\lbrace {1,\ldots , p}\rbrace }\). The positivity of \(v_{\mathcal {A}}- v_{\mathrm{full}}\) can be similarly proved: \(M_{\mathcal {A}}(x)\) is a linear combination of \(Y(x_k)\), \(k \in {\lbrace {1, \ldots , n}\rbrace }\), whereas \(M_{\mathrm{full}}(x)\) is the best linear combination. Note that \(v_{\mathcal {A}}(x)-v_{\mathrm{full}}(x) \ge 0\) implies, using Eq. (15), that \({\Vert {}k_{\mathcal {A}}(X,x)\Vert }_K\le {\Vert {}k(X,x)\Vert }_K\). Let us show Eq. (16). We get the result starting from Eq. (39), applying Cauchy-Schwartz inequality. The bound on \(v_{\mathcal {A}}(x)-v_{\mathrm{full}}(x)\) derives directly from Eq. (15), using \({\Vert {}k_{\mathcal {A}}(X,x)\Vert }_K\le {\Vert {}k(X,x)\Vert }_K\).
Finally, the classical inequality between \({\Vert {}.\Vert }_K\) and \({\Vert {}.\Vert }\) derives from the diagonalization of k(X, X). One can observe that it depends on n and X, but it does not depend on the prediction point x. \(\square \)
Proof of Remark 4
Using \({\Vert {}k_{\mathcal {A}}(X,x)\Vert }_K\le {\Vert {}k(X,x)\Vert }_K\), using the equivalence of norms and triangular inequality, assuming that the smallest eigenvalue \(\lambda _{\min }\) of k(X, X) is nonzero, bounds Eq. (16) in the previous Proposition 5 imply that
Noting that the \({\Vert {}.\Vert }_K\) and \(\lambda _{\min }\) do not depend on x (although they depend on X and n), the result holds. \(\square \)
Proof of Remark 5
As \(\Lambda (x)\) is \(n \times n\) and invertible, we have
and similarly \(v_{{\mathcal {A}}}(x) = v_{{\mathrm{full}}}(x)\). As \(M_{\mathcal {A}}= M_{\mathrm{full}}\), we have \(Y_{\mathcal {A}}= M_{\mathrm{full}}+ \varepsilon \) where \(\varepsilon \) is an independent copy of \(Y- M_{\mathrm{full}}\). Furthermore \(Y= M_{\mathrm{full}}+ Y- M_{\mathrm{full}}\) where \(M_{\mathrm{full}}\) and \(Y- M_{\mathrm{full}}\) are independent, by Gaussianity, so \(Y_{\mathcal {A}}{\mathop {=}\limits ^{law}} Y\). \(\square \)
Appendix 5. Proofs from Sect. 4
Proof of Proposition 6
Let \((x,v_1,\ldots ,v_r)\) be \(r+1\) two-by-two distinct real numbers. If \(x < \min (v_1,\ldots ,v_r)\), then the conditional expectation of Y(x) given \(Y(v_1),\ldots ,Y(v_r)\) is equal to \(\exp ( - | \min (v_1,\ldots ,v_r) - x | / \theta ) Y( \min (v_1,\ldots ,v_r) )\) (Ying 1991). Similarly, if \(x > \max (v_1,\ldots ,v_r)\), then the conditional expectation of Y(x) given \(Y(v_1),\ldots ,Y(v_r)\) is equal to \(\exp ( - | \max (v_1,\ldots ,v_r) - x | / \theta ) Y( \max (v_1,\ldots ,v_r) )\). If \(\min (v_1,\ldots ,v_r)< x < \max (v_1,\ldots ,v_r)\), then the conditional expectation of Y(x) given \(Y(v_1),\ldots ,Y(v_r)\) is equal to \(a Y( x_{<} ) + bY( x_{>} )\), where \(x_{<}\) and \(x_{>}\) are the leftmost and rightmost neighbors of x in \(\{ v_1 , \ldots , v_r \}\) and where a, b are nonzero real numbers (Bachoc et al. 2017). Finally, because the covariance matrix of \(Y(v_1),\ldots ,Y(v_r)\) is invertible, two linear combinations \(\sum _{i=1}^r a_i Y(v_i)\) and \(\sum _{i=1}^r b_i Y(v_i)\) are almost surely equal if and only if \((a_1,\ldots ,a_r) = (b_1,\ldots ,b_r)\).
Assume that \(X_1,\ldots ,X_p\) is a perfect clustering and let \(x \in \mathbb {R}\). It is known from Rullière et al. (2018) that \(M_{\mathcal {A}}(x) = M_{\mathrm{full}}(x)\) almost surely if \(x \in \{x_1,\ldots ,x_n\}\). Consider now that \(x \not \in \{x_1,\ldots ,x_n\}\).
If \(x < \min (x_1,\ldots ,x_n)\), then for \(i=1,\ldots ,p\), \(M_i(x) = \exp ( - | x_{j_i} - x | / \theta ) Y(x_{j_i})\) with \(x_{j_i} = \min \{ x ; x \in X_i \}\). Let \(i^* \in \{1,\ldots ,p\}\) be so that \(\min (x_1,\ldots ,x_n) \in X_{i^*}\). Then \(M_{\mathrm{full}}(x) = \exp ( - | x_{j_{i^*}} - x | / \theta ) Y(x_{j_{i^*}})\). As a consequence, the linear combination \(\lambda _x^t M(x) \) minimizing \({{\,\mathrm{\mathrm {E}}\,}}[ (\lambda ^t M(x) - Y(x) )^2]\) over \(\lambda \in \mathbb {R}^p\) is given by \(\lambda _x = e_{i^*}\) with \(e_{i^*}\) the \(i^*\)-th base column vector of \(\mathbb {R}^p\). This implies that \(M_{\mathrm{full}}(x) = M_{{\mathcal {A}}}(x)\) almost surely. Similarly, if \(x > \max (x_1,\ldots ,x_n)\), then \(M_{\mathrm{full}}(x) = M_{{\mathcal {A}}}(x)\) almost surely.
Consider now that there exists \(u \in X_{i}\) and \(v \in X_{j}\) so that \( u< x < v \) and (u, v) does not intersect with \(\{x_1,\ldots ,x_n\}\). If \(i = j\), then \(M_i(x) = M_{{\mathrm{full}}}(x)\) almost surely because the leftmost and rightmost neighbors of x are both in \(X_i\). Hence, also \(M_{{\mathcal {A}}}(x) =M_{\mathrm{full}}(x)\) almost surely in this case. If \(i \ne j\), then \(u = \max \{t ; t \in X_i\}\) and \(v = \min \{t ; t \in X_j\}\) because \(X_1,\ldots ,X_p\) is a perfect clustering. Hence, \(M_i(x) = \exp ( - |x - u|) Y(u)\), \(M_j(x) = \exp ( - |x - v|) Y(v)\) and \(M_{{\mathrm{full}}}(x) = a Y(u) + bY(v)\) with \(a,b \in \mathbb {R}\). Hence, there exists a linear combination \(\lambda _i M_i(x) + \lambda _j M_j(x)\) that equals \(M_{{\mathrm{full}}}(x)\) almost surely. As a consequence, the linear combination \(\lambda _x^t M(x) \) minimizing \({{\,\mathrm{\mathrm {E}}\,}}[ (\lambda ^t M(x) - Y(x) ]^2)\) over \(\lambda \in \mathbb {R}^p\) is given by \(\lambda _x = \lambda _i e_{i} + \lambda _j e_j\), with \(e_i\) and \(e_j\) the i-th and j-th base column vectors of \(\mathbb {R}^p\). Hence \(M_{\mathrm{full}}(x) = M_{{\mathcal {A}}}(x)\) almost surely. All the possible sub-cases have now been treated, which proves the first implication of the proposition.
Assume now that \(X_1,\ldots ,X_p\) is not a perfect clustering. Then there exists a triplet u, v, w, with \(u,v \in X_i\) and \(w \in X_j\) with \(i,j=1 \ldots ,p\), \(i \ne j\), and so that \( u< w <v \). Without loss of generality it can further be assumed that there does not exist \(z \in X_i\) satisfying \(u< z <v\).
Let x satisfy \( u< x < w \) and so that (u, x) does not intersect \(\{ x_1,\ldots ,x_n \}\). Then \(M_{{\mathrm{full}}} (x) = a Y(u) + b Y(z)\) with \(a , b \in \mathbb {R}\backslash \{0\}\) and \(z \in \{ x_1,\ldots ,x_n \}\), \(z \ne v\). Also, \(M_i(x) = \alpha Y(u) + \beta Y(v)\) with \(\alpha , \beta \in \mathbb {R}\backslash \{0\}\). As a consequence, there cannot exist a linear combination \(\lambda ^t M(x)\) with \(\lambda \in \mathbb {R}^p\) so that \(\lambda ^t M(x) = a Y(u) + b Y(w)\). Indeed, a linear combination \(\lambda ^t M(x)\) is a linear combination of \(Y(x_1),\ldots ,Y(x_n)\) where the coefficients for Y(u) and Y(v) are \(\lambda _i \alpha \) and \(\lambda _i \beta \), which are either simultaneously zero or simultaneously nonzero. Hence, \(M_{\mathcal {A}}(x)\) is almost surely not equal to \(M_{\mathrm{full}}(x)\). This concludes the proof. \(\square \)
Proof of Proposition 7
Let \(x \in \mathbb {R}\setminus X\). For \(i=1, \ldots , p\), \(0< v_i(x) < v_\mathrm{prior}(x)\), so that \(\alpha _i(v_1(x),\ldots ,v_p(x),v_\mathrm{prior}(x)) \in \mathbb {R}\setminus {\lbrace {0}\rbrace }\). Hence, the linear combination \(M_{\mathcal {A}}(x) = \sum _{k=1}^{p} \alpha _{k}(v_1(x),\ldots ,v_{p}(x),v_\mathrm{prior}(x)) M_k(x)\) is a linear combination of \(Y(x_1),\ldots ,Y(x_n)\) with at least p nonzero coefficients (since each \(M_k(x)\) is a linear combination of one or two elements of \(Y(x_1),\ldots ,Y(x_n)\), all these elements being two-by-two distinct, see the beginning of the proof of Proposition 6). Hence, because the covariance matrix of \(Y(x_1),\ldots ,Y(x_n)\) is invertible, \(M_{\mathcal {A}}(x)\) almost surely cannot be equal to \(M_{{\mathrm{full}}}(x)\), since \(M_{{\mathrm{full}}}(x)\) is a linear combination of \(Y(x_1),\ldots ,Y(x_n)\) with one or two nonzero coefficients. \(\square \)
Appendix 6. Proofs from Sect. 5
Proof of Proposition 8
Because \(M_{{\mathcal {A}},\eta }(x)\) is the best linear predictor of Y(x), for \(n \in \mathbb {N}\), we have
Let \(\epsilon >0\). Let \(N_n\) be the number of points in \(X_{i_n}\) that are at Euclidean distance less than \(\epsilon \) from x. By assumption, \(N_n \rightarrow \infty \) as \(n \rightarrow \infty \). Let us write these points as \(x_{nj_1},\ldots ,x_{nj_{N_n}}\), with corresponding measurement errors \(\xi _{j_1},\ldots ,\xi _{j_{N_n}}\). Since \(M_{\eta ,i_n}(x)\) is the best linear unbiased predictor of Y(x) from the elements of \(Y(x_{n j_1}) + \xi _{j_1} , \ldots ,Y(x_{n j_{N_n}}) + \xi _{j_{N_n}} \), we have
By independence of Y and \(\xi _X\), we obtain
The above inequality follows from Cauchy-Schwarz, the fact that Y has mean zero and the independence of \(\xi _{j_1},\ldots ,\xi _{j_{N_n}}\). We then obtain, since \((\eta _a)_{a \in \mathbb {N}}\) is bounded,
From Eqs. (41) and (42), we have, for any \(\epsilon >0\)
The above display tends to zero as \(\epsilon \rightarrow 0\) because k is continuous. Hence the \(\limsup \) in Eq. (43) is zero, which concludes the proof. \(\square \)
Proof of Lemma 1
Let \(\epsilon >0\). For \(n \in \mathbb {N}\), let \(N_n\) be the number of points in \(\{x_1,\ldots ,x_n \}\) that are at Euclidean distance less than \(\epsilon \) to x. Because x is in the interior of D and because \(g >0\) on D, we have \(p_{\epsilon } = {{\,\mathrm{\mathrm {P}}\,}}( ||x_1 - x || \le \epsilon ) >0\). Hence from the law of large numbers, almost surely, for large enough n, \(N_n \ge (p_{\epsilon }/2) n \). For each \(n \in \mathbb {N}\), the \(N_n\) points in \(\{x_1,\ldots ,x_n \}\) that are at Euclidean distance less than \(\epsilon \) to x are partitioned into \(p_n\) classes. Hence, one of these classes, say the class \(X_{i_n}\), contains a number of points larger than or equal to \( N_n / p_n \). Since \(n / p_n\) tends to infinity by assumption, we conclude that the number of points in \(X_{i_n}\) at distance less than \(\epsilon \) from x almost surely tends to infinity. This concludes the proof. \(\square \)
Proof of Proposition 9
The proof is based on the same construction of the triangular array of observation points and of the sequence of partitions as in the proof of Proposition 1. We take x as \(x_0\) in this proof. Only a few comments are needed.
We let \(V(\delta )\) be as in the proof of Proposition 1 and we note that for any \(\delta >0\), for any \(r \in \mathbb {N}\), for any Gaussian vector \((U_1,\ldots ,U_{r})\) independent of Y and for any \(u_0,u_1,\ldots ,u_r \in D\) with \(||u_i - u_0|| \ge \delta \) for \(i=1,\ldots ,r\), we have
We also note that the triangular array and sequence of partitions of the proof of Proposition 1 do satisfy the condition of Proposition 8. Indeed, the first component \(X_1\) of the partition, with cardinality \(C_n \rightarrow \infty \), is dense in D.
We note that for \(k=k_n+1,\ldots ,p_n\) (notations of the proof of Proposition 1), for any row of \(X_k\), of the form \(x_{nb}\) with \(b \in \{ 1 , \ldots ,n\}\), we have \(v_k(x) \le V[ Y(x) | Y(x_{nb}) + \xi _{b} ] \le k(x,x) - k(x,x_{nb})^2 / (k(x_{nb},x_{nb}) + \eta _{b})\). Hence, because \((\eta _i)_{i \in \mathbb {N}}\) is bounded, there is a fixed \(\epsilon '_2 >0\) such that for \(k=k_n+1,\ldots ,p_n\), \(\epsilon _1 \le v_k(x) \le k(x,x) - \epsilon '_2\), with \(\epsilon _1\) as in the proof of Proposition 1.
With these comments, the arguments of the proof of Proposition 1 lead to the conclusion of Proposition 9. \(\square \)
Proof of Proposition 10
We can see that \(M_{\text {UK},i}(x) = w_i(x)^t Z(X_i) \) for \(i=1,\ldots ,p\). Hence, for \(i,j=1,\ldots ,p\)
Hence, \({{\,\mathrm{\mathrm {Cov}}\,}}\left[ M_{\text {UK}}(x) \right] = K_{\text {UK},M}(x,x) \). Furthermore, for \(i = 1 , \ldots ,p\)
Hence, \({{\,\mathrm{\mathrm {Cov}}\,}}\left[ M_{\text {UK}}(x) , Z(x) \right] = k_{\text {UK},M}(x,x) \). Let
Since \({{\,\mathrm{\mathrm {E}}\,}}[ M_{\text {UK},i}(x) ] = {{\,\mathrm{\mathrm {E}}\,}}[ Z(x) ]\) for \(i=1,\ldots ,p\) and for any value of \(\beta \in \mathbb {R}^m\), the constraint in Eq. (44) can be written as \(\gamma ^t \mathrm {1}_p{{\,\mathrm{\mathrm {E}}\,}}[ Z(x)] = {{\,\mathrm{\mathrm {E}}\,}}[ Z(x) ] \); that is, \(\gamma ^t \mathrm {1}_p= 1\). The mean square prediction error in Eq. (44) can be written as
Thus Eq. (44) becomes
We recognize the optimization problem of ordinary kriging which corresponds to universal kriging with an unknown constant mean function (Sacks et al. 1989; Chilès and Delfiner 2012). Hence, we have
from for instance Sacks et al. (1989), Chilès and Delfiner (2012). Hence we have \(\alpha (x)^t M_{\text {UK}}(x) = M_{{\mathcal {A}},\text {UK}}(x)\), the best linear predictor described in Proposition 10.
We can see that \(M_{{\mathcal {A}},\text {UK}}(x) = \alpha _{{\mathcal {A}},\text {UK}}(x)^t M_{\text {UK}}(x) \) and that \(\alpha _{{\mathcal {A}},\text {UK}}(x) = \alpha (x)\). Then, since \({{\,\mathrm{\mathrm {E}}\,}}[ \alpha _{{\mathcal {A}},\text {UK}}(x)^t M_{\text {UK}}(x) ] = Z(x)\), from \({{\,\mathrm{\mathrm {Cov}}\,}}\left[ M_{\text {UK}}(x) \right] = K_{\text {UK},M}(x,x) \) and from \({{\,\mathrm{\mathrm {Cov}}\,}}\left[ M_{\text {UK}}(x) , Z(x) \right] = k_{\text {UK},M}(x,x) \), we obtain
This concludes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Bachoc, F., Durrande, N., Rullière, D. et al. Properties and Comparison of Some Kriging Sub-model Aggregation Methods. Math Geosci 54, 941–977 (2022). https://doi.org/10.1007/s11004-021-09986-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11004-021-09986-2