## Abstract

This work falls within the context of predicting the value of a real function at some input locations given a limited number of observations of this function. The Kriging interpolation technique (or Gaussian process regression) is often considered to tackle such a problem, but the method suffers from its computational burden when the number of observation points is large. We introduce in this article nested Kriging predictors which are constructed by aggregating sub-models based on subsets of observation points. This approach is proven to have better theoretical properties than other aggregation methods that can be found in the literature. Contrarily to some other methods it can be shown that the proposed aggregation method is consistent. Finally, the practical interest of the proposed method is illustrated on simulated datasets and on an industrial test case with \(10^4\) observations in a 6-dimensional space.

This is a preview of subscription content, log in to check access.

## References

Bachoc, F.: Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model mispecification. Comput. Stat. Data Anal.

**66**, 55–69 (2013)Bachoc, F., Durrande, N., Rullière, D., Chevalier, C.: Some properties of nested Kriging predictors. Technical report hal-01561747 (2017)

Bhatnagar, S., Prasad, H., Prashanth, L.: Stochastic Recursive Algorithms for Optimization, vol. 434. Springer, New York (2013)

Cao, Y., Fleet, D.J.: Generalized product of experts for automatic and principled fusion of Gaussian process predictions. arXiv preprint arXiv:1410.7827v2, CoRR, abs/1410.7827:1–5. Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, Montreal (2014)

Deisenroth, M.P., Ng, J.W.: Distributed Gaussian processes. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France. JMLR: W & CP vol. 37 (2015)

Genest, C., Zidek, J.V.: Combining probability distributions: a critique and an annotated bibliography. Stat. Sci.

**1**(1), 114–135 (1986)Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)

Guhaniyogi, R., Finley, A.O., Banerjee, S., Gelfand, A.E.: Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics

**22**(8), 997–1007 (2011)Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence Conference. Paper Id 244 (2013)

Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput.

**14**(8), 1771–1800 (2002)Katzfuss, M.: Bayesian nonstationary spatial modeling for very large datasets. Environmetrics

**24**(3), 189–200 (2013)Lemaitre, J., Chaboche, J.-L.: Mechanics of Solid Materials. Cambridge University Press, Cambridge (1994)

Maurya, A.: A well-conditioned and sparse estimation of covariance and inverse covariance matrices using a joint penalty. J. Mach. Learn. Res.

**17**(1), 4457–4484 (2016)Nickson, T., Gunter, T., Lloyd, C., Osborne, M.A., Roberts, S.: Blitzkriging: Kronecker-structured stochastic Gaussian processes. arXiv preprint arXiv:1510.07965v2, pp 1–13 (2015)

Ranjan, R., Gneiting, T.: Combining probability forecasts. J. R. Stat. Soc. Ser. B (Stat. Methodol.)

**72**(1), 71–91 (2010)Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J. Stat. Softw.

**51**(1), 1–55 (2012)Rue, H., Held, L.: Gaussian Markov Random Fields. Theory and Applications. Chapman & Hall, London (2005)

Samo, Y.-L.K., Roberts, S.J.: String and membrane Gaussian processes. J. Mach. Learn. Res.

**17**(131), 1–87 (2016)Santner, T.J., Williams, B.J., Notz, W.I.: The Design and Analysis of Computer Experiments. Springer, Berlin (2013)

Satopää, V.A., Pemantle, R., Ungar, L.H.: Modeling probability forecasts via information diversity. J. Am. Stat. Assoc.

**111**(516), 1623–1633 (2016)Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCulloch, R.E.: Bayes and big data: the consensus monte carlo algorithm. Int. J. Manag. Sci. Eng. Manag.

**11**(2), 78–88 (2016)Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer, Berlin (2012)

Stein, M.L.: Limitations on low rank approximations for covariance matrices of spatial data. Spat. Stat.

**8**, 1–19 (2014)Tresp, V.: A bayesian committee machine. Neural Comput.

**12**(11), 2719–2741 (2000)Tzeng, S., Huang, H.-C., Cressie, N.: A fast, optimal spatial-prediction method for massive datasets. J. Am. Stat. Assoc.

**100**(472), 1343–1357 (2005)van Stein, B., Wang, H., Kowalczyk, W., Bäck, T., Emmerich, M.: Optimally weighted cluster Kriging for big data regression. In: International Symposium on Intelligent Data Analysis, pp. 310–321. Springer (2015)

Wahba, G.: Spline Models for Observational Data, vol. 59. SIAM, Philadelphia (1990)

Wei, H., Du, Y., Liang, F., Zhou, C., Liu, Z., Yi, J., Xu, K., Wu, D.: A k-d tree-based algorithm to parallelize Kriging interpolation of big spatial data. GISci. Remote Sens.

**52**(1), 40–57 (2015)Williams, C.K., Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)

Winkler, R.L.: The consensus of subjective probability distributions. Manag. Sci.

**15**(2), B–61 (1968)Winkler, R.L.: Combining probability distributions from dependent information sources. Manag. Sci.

**27**(4), 479–488 (1981)Zhang, B., Sang, H., Huang, J.Z.: Full-scale approximations of spatio-temporal covariance models for large datasets. Stat. Sinica

**25**(1), 99–114 (2015)

## Acknowledgements

Part of this research was conducted within the frame of the Chair in Applied Mathematics OQUAIDO, gathering partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (Ecole Centrale de Lyon, Mines Saint-Etienne, University of Grenoble, University of Nice, University of Toulouse and CNRS) around advanced methods for Computer Experiments. The authors would like to warmly thank Dr. Géraud Blatman and EDF R&D for providing us the industrial test case. They also thank both editor and reviewers for very precise and constructive comments on this paper. This paper has been finished during a stay of D. Rullière at Vietnam Institute for Advanced Study in Mathematics, the latter author thanks the VIASM institute and DAMI research chair (Data Analytics & Models for Insurance) for their support.

## Author information

### Affiliations

### Corresponding author

## Appendix: Proof of Proposition 4

### Appendix: Proof of Proposition 4

*Complexities* Under chosen assumption on \(\alpha \) and \(\beta \) coefficients, for a regular tree and in the case of simple Kriging sub-models, \(\mathcal {C}_\alpha =\sum _{\nu =1}^{\bar{\nu }} \sum _{i=1}^{n_\nu } \alpha c_\nu ^3 =\alpha \sum _{\nu =1}^{\bar{\nu }} c_\nu ^3 n_\nu \) and \(\mathcal {C}_\beta =\sum _{\nu =1}^{\bar{\nu }} \sum _{i=2}^{n_\nu } \sum _{j=1}^{i-1}\beta c^2_{\nu } =\frac{\beta }{2} \sum _{\nu =1}^{\bar{\nu }} n_\nu (n_{\nu }-1) c^2_{\nu }\). Notice that the sum starts from \(\nu =1\) in order to include sub-models calculation. *Equilibrated trees complexities* in a constant child number setting, when \(c_\nu =c\) for all \(\nu \), the tree structure ensures that \(n_{\nu }=n/c^{\nu }\), thus as \(c=n^{1/\bar{\nu }}\), we get when \(n \rightarrow +\infty \), \(\mathcal {C}_\alpha \sim \alpha n^{1+\frac{2}{\bar{\nu }}}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2} n^2\). The result for equilibrated two-layer tree where \(\bar{\nu }=2\) directly derives from this one, and in this case \(\mathcal {C}_\alpha \sim \alpha n^{2}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2} n^2\) (it derives also from the expressions of \(\mathcal {C}_\alpha \), \(\mathcal {C}_\beta \), when \(c_1=c_2=\sqrt{n}\), \(n_1=\sqrt{n}\), \(n_2=1\)). *Optimal tree complexities* one easily shows that under the chosen assumptions \(\mathcal {C}_\beta \sim \frac{\beta }{2}n^2\). Thus, it is indeed not possible to reduce the whole complexity to orders lower than \(O(n^2)\). However, one can choose the tree structure in order to reduce the complexity \(\mathcal {C}_\alpha \). For a regular tree, \(n_\nu =n/(c_1 \cdots c_{\nu })\) such that \(\frac{\partial }{\partial c_k} n_{\nu } = -\mathbf {1}_{{\lbrace {\nu \ge k}\rbrace }} n_\nu /c_k\). Using a Lagrange multiplier \(\ell \), one defines \(\xi (k)=c_k \frac{\partial }{\partial c_k} \left( \mathcal {C}_\alpha - \ell (c_1 \cdots c_{\bar{\nu }} -n) \right) = 3\alpha c_k^3 n_k - \alpha \sum _{\nu =k}^{\bar{\nu }}c_\nu ^3 n_\nu - \ell c_1 \cdots c_{\bar{\nu }}\). The tree structure that minimizes \(\mathcal {C}_\alpha \) is such that for all \(k<\bar{\nu }\), \(\xi (k)=\xi (k+1)=0\). Using \(c_{k+1} n_{k+1}=n_k\), one gets \(3c_{k+1}^2=2 c_{k}^3\) for all \(k<\bar{\nu }\), and setting \(c_1\cdots c_{\bar{\nu }}=n\), \(c_\nu = \delta \left( \delta ^{-\bar{\nu }} n\right) ^{\frac{\delta ^{\nu -1}}{2(\delta ^{\bar{\nu }}-1)}}\), \(\nu =1, \ldots , \bar{\nu }\), with \(\delta =\frac{3}{2}\). Setting \(\gamma =\frac{27}{4}\delta ^{-\frac{\bar{\nu }}{\delta ^{\bar{\nu }}-1}}\left( 1-\delta ^{-\bar{\nu }}\right) \). After some direct calculations this tree structure corresponds to complexities, \(\mathcal {C}_\alpha = \gamma \alpha n^{1+\frac{1}{\delta ^{\bar{\nu }}-1}}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2}n^2\). In a two-layers setting one gets \(c_1=\left( \frac{3}{2}\right) ^{1/5} n^{2/5}\) and \(c_2=\left( \frac{3}{2}\right) ^{-1/5} n^{3/5}\), which leads to \(\mathcal {C}_\alpha = \gamma \alpha n^{9/5}\) and \(\mathcal {C}_\beta = \frac{\beta }{2} n^2 - \frac{\beta }{2} \left( \frac{3}{2}\right) ^{\frac{1}{5}}n^{\frac{7}{5}}\), where \(\gamma =(\frac{2}{3})^{-2/5}+(\frac{2}{3})^{3/5}\simeq 1.96\) (eventually notice that even for values of *n* of order \(10^5\), terms of order like \(n^{9/5}\) are not necessarily negligible compared to those of order \(n^2\), and that \(\mathcal {C}_\beta \) is slightly affected by the choice of the tree structure, but the global complexity benefits from the optimization of \(\mathcal {C}_\alpha \)).

*Storage footprint* First, covariances can be stored in triangular matrices. So temporary objects *M*, *k* and *K* in Algorithm 1 require the storage of \(c_{\max }(c_{\max }+5)/2\) real values. For a given step \(\nu \), \(\nu \ge 2\), building all vectors \(\alpha _i\) requires the storage of \(\sum _{i=1}^{n_{\nu }} c_i^{\nu }=n_{\nu -1}\) values. At last, for a given step \(\nu \), we simultaneously need objects \(M_{\nu -1}, K_{\nu -1}, M_{\nu }, K_{\nu }\), which require the storage of \(n_{\nu -1}(n_{\nu -1}+3)/2 + n_{\nu }(n_{\nu }+3)/2\) real values. In a regular tree, as \(n_\nu \) is decreasing in \(\nu \), the storage footprint is \(\mathcal {S} = (c_{\max }(c_{\max }+5) + n_1(n_1+5) + n_2(n_2+3))/2\). Hence the equivalents for \(\mathcal {S}\) for the different tree structures, \(\mathcal {S}\sim n\) for the two-layer equilibrated tree, \(\mathcal {S}\sim \frac{1}{2}n^{2-2/\bar{\nu }}\) for the \(\bar{\nu }\)-layer, \(\bar{\nu }>2\) and the indicated result for the optimal tree. Simple orders are given in the proposition, which avoids separating the case \(\bar{\nu }=2\) and a cumbersome constant for the optimal tree.

## Rights and permissions

## About this article

### Cite this article

Rullière, D., Durrande, N., Bachoc, F. *et al.* Nested Kriging predictions for datasets with a large number of observations.
*Stat Comput* **28, **849–867 (2018). https://doi.org/10.1007/s11222-017-9766-2

Received:

Accepted:

Published:

Issue Date:

### Keywords

- Gaussian process regression
- Big data
- Aggregation methods
- Best linear unbiased predictor
- Spatial processes