Nested Kriging predictions for datasets with a large number of observations

Abstract

This work falls within the context of predicting the value of a real function at some input locations given a limited number of observations of this function. The Kriging interpolation technique (or Gaussian process regression) is often considered to tackle such a problem, but the method suffers from its computational burden when the number of observation points is large. We introduce in this article nested Kriging predictors which are constructed by aggregating sub-models based on subsets of observation points. This approach is proven to have better theoretical properties than other aggregation methods that can be found in the literature. Contrarily to some other methods it can be shown that the proposed aggregation method is consistent. Finally, the practical interest of the proposed method is illustrated on simulated datasets and on an industrial test case with \(10^4\) observations in a 6-dimensional space.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

References

  1. Bachoc, F.: Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model mispecification. Comput. Stat. Data Anal. 66, 55–69 (2013)

    Article  Google Scholar 

  2. Bachoc, F., Durrande, N., Rullière, D., Chevalier, C.: Some properties of nested Kriging predictors. Technical report hal-01561747 (2017)

  3. Bhatnagar, S., Prasad, H., Prashanth, L.: Stochastic Recursive Algorithms for Optimization, vol. 434. Springer, New York (2013)

    Google Scholar 

  4. Cao, Y., Fleet, D.J.: Generalized product of experts for automatic and principled fusion of Gaussian process predictions. arXiv preprint arXiv:1410.7827v2, CoRR, abs/1410.7827:1–5. Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, Montreal (2014)

  5. Deisenroth, M.P., Ng, J.W.: Distributed Gaussian processes. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France. JMLR: W & CP vol. 37 (2015)

  6. Genest, C., Zidek, J.V.: Combining probability distributions: a critique and an annotated bibliography. Stat. Sci. 1(1), 114–135 (1986)

    MathSciNet  Article  MATH  Google Scholar 

  7. Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)

    Google Scholar 

  8. Guhaniyogi, R., Finley, A.O., Banerjee, S., Gelfand, A.E.: Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics 22(8), 997–1007 (2011)

    MathSciNet  Article  Google Scholar 

  9. Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence Conference. Paper Id 244 (2013)

  10. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)

    Article  MATH  Google Scholar 

  11. Katzfuss, M.: Bayesian nonstationary spatial modeling for very large datasets. Environmetrics 24(3), 189–200 (2013)

    MathSciNet  Article  Google Scholar 

  12. Lemaitre, J., Chaboche, J.-L.: Mechanics of Solid Materials. Cambridge University Press, Cambridge (1994)

    Google Scholar 

  13. Maurya, A.: A well-conditioned and sparse estimation of covariance and inverse covariance matrices using a joint penalty. J. Mach. Learn. Res. 17(1), 4457–4484 (2016)

    MathSciNet  MATH  Google Scholar 

  14. Nickson, T., Gunter, T., Lloyd, C., Osborne, M.A., Roberts, S.: Blitzkriging: Kronecker-structured stochastic Gaussian processes. arXiv preprint arXiv:1510.07965v2, pp 1–13 (2015)

  15. Ranjan, R., Gneiting, T.: Combining probability forecasts. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72(1), 71–91 (2010)

    MathSciNet  Article  Google Scholar 

  16. Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)

  17. Rue, H., Held, L.: Gaussian Markov Random Fields. Theory and Applications. Chapman & Hall, London (2005)

    Google Scholar 

  18. Samo, Y.-L.K., Roberts, S.J.: String and membrane Gaussian processes. J. Mach. Learn. Res. 17(131), 1–87 (2016)

    MathSciNet  MATH  Google Scholar 

  19. Santner, T.J., Williams, B.J., Notz, W.I.: The Design and Analysis of Computer Experiments. Springer, Berlin (2013)

    Google Scholar 

  20. Satopää, V.A., Pemantle, R., Ungar, L.H.: Modeling probability forecasts via information diversity. J. Am. Stat. Assoc. 111(516), 1623–1633 (2016)

    MathSciNet  Article  Google Scholar 

  21. Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCulloch, R.E.: Bayes and big data: the consensus monte carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11(2), 78–88 (2016)

    Google Scholar 

  22. Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer, Berlin (2012)

    Google Scholar 

  23. Stein, M.L.: Limitations on low rank approximations for covariance matrices of spatial data. Spat. Stat. 8, 1–19 (2014)

    MathSciNet  Article  Google Scholar 

  24. Tresp, V.: A bayesian committee machine. Neural Comput. 12(11), 2719–2741 (2000)

    Article  Google Scholar 

  25. Tzeng, S., Huang, H.-C., Cressie, N.: A fast, optimal spatial-prediction method for massive datasets. J. Am. Stat. Assoc. 100(472), 1343–1357 (2005)

    MathSciNet  Article  MATH  Google Scholar 

  26. van Stein, B., Wang, H., Kowalczyk, W., Bäck, T., Emmerich, M.: Optimally weighted cluster Kriging for big data regression. In: International Symposium on Intelligent Data Analysis, pp. 310–321. Springer (2015)

  27. Wahba, G.: Spline Models for Observational Data, vol. 59. SIAM, Philadelphia (1990)

    Google Scholar 

  28. Wei, H., Du, Y., Liang, F., Zhou, C., Liu, Z., Yi, J., Xu, K., Wu, D.: A k-d tree-based algorithm to parallelize Kriging interpolation of big spatial data. GISci. Remote Sens. 52(1), 40–57 (2015)

    Article  Google Scholar 

  29. Williams, C.K., Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)

    Google Scholar 

  30. Winkler, R.L.: The consensus of subjective probability distributions. Manag. Sci. 15(2), B–61 (1968)

    Article  Google Scholar 

  31. Winkler, R.L.: Combining probability distributions from dependent information sources. Manag. Sci. 27(4), 479–488 (1981)

    Article  MATH  Google Scholar 

  32. Zhang, B., Sang, H., Huang, J.Z.: Full-scale approximations of spatio-temporal covariance models for large datasets. Stat. Sinica 25(1), 99–114 (2015)

Download references

Acknowledgements

Part of this research was conducted within the frame of the Chair in Applied Mathematics OQUAIDO, gathering partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (Ecole Centrale de Lyon, Mines Saint-Etienne, University of Grenoble, University of Nice, University of Toulouse and CNRS) around advanced methods for Computer Experiments. The authors would like to warmly thank Dr. Géraud Blatman and EDF R&D for providing us the industrial test case. They also thank both editor and reviewers for very precise and constructive comments on this paper. This paper has been finished during a stay of D. Rullière at Vietnam Institute for Advanced Study in Mathematics, the latter author thanks the VIASM institute and DAMI research chair (Data Analytics & Models for Insurance) for their support.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nicolas Durrande.

Appendix: Proof of Proposition 4

Appendix: Proof of Proposition 4

Complexities Under chosen assumption on \(\alpha \) and \(\beta \) coefficients, for a regular tree and in the case of simple Kriging sub-models, \(\mathcal {C}_\alpha =\sum _{\nu =1}^{\bar{\nu }} \sum _{i=1}^{n_\nu } \alpha c_\nu ^3 =\alpha \sum _{\nu =1}^{\bar{\nu }} c_\nu ^3 n_\nu \) and \(\mathcal {C}_\beta =\sum _{\nu =1}^{\bar{\nu }} \sum _{i=2}^{n_\nu } \sum _{j=1}^{i-1}\beta c^2_{\nu } =\frac{\beta }{2} \sum _{\nu =1}^{\bar{\nu }} n_\nu (n_{\nu }-1) c^2_{\nu }\). Notice that the sum starts from \(\nu =1\) in order to include sub-models calculation. Equilibrated trees complexities in a constant child number setting, when \(c_\nu =c\) for all \(\nu \), the tree structure ensures that \(n_{\nu }=n/c^{\nu }\), thus as \(c=n^{1/\bar{\nu }}\), we get when \(n \rightarrow +\infty \), \(\mathcal {C}_\alpha \sim \alpha n^{1+\frac{2}{\bar{\nu }}}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2} n^2\). The result for equilibrated two-layer tree where \(\bar{\nu }=2\) directly derives from this one, and in this case \(\mathcal {C}_\alpha \sim \alpha n^{2}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2} n^2\) (it derives also from the expressions of \(\mathcal {C}_\alpha \), \(\mathcal {C}_\beta \), when \(c_1=c_2=\sqrt{n}\), \(n_1=\sqrt{n}\), \(n_2=1\)). Optimal tree complexities one easily shows that under the chosen assumptions \(\mathcal {C}_\beta \sim \frac{\beta }{2}n^2\). Thus, it is indeed not possible to reduce the whole complexity to orders lower than \(O(n^2)\). However, one can choose the tree structure in order to reduce the complexity \(\mathcal {C}_\alpha \). For a regular tree, \(n_\nu =n/(c_1 \cdots c_{\nu })\) such that \(\frac{\partial }{\partial c_k} n_{\nu } = -\mathbf {1}_{{\lbrace {\nu \ge k}\rbrace }} n_\nu /c_k\). Using a Lagrange multiplier \(\ell \), one defines \(\xi (k)=c_k \frac{\partial }{\partial c_k} \left( \mathcal {C}_\alpha - \ell (c_1 \cdots c_{\bar{\nu }} -n) \right) = 3\alpha c_k^3 n_k - \alpha \sum _{\nu =k}^{\bar{\nu }}c_\nu ^3 n_\nu - \ell c_1 \cdots c_{\bar{\nu }}\). The tree structure that minimizes \(\mathcal {C}_\alpha \) is such that for all \(k<\bar{\nu }\), \(\xi (k)=\xi (k+1)=0\). Using \(c_{k+1} n_{k+1}=n_k\), one gets \(3c_{k+1}^2=2 c_{k}^3\) for all \(k<\bar{\nu }\), and setting \(c_1\cdots c_{\bar{\nu }}=n\), \(c_\nu = \delta \left( \delta ^{-\bar{\nu }} n\right) ^{\frac{\delta ^{\nu -1}}{2(\delta ^{\bar{\nu }}-1)}}\), \(\nu =1, \ldots , \bar{\nu }\), with \(\delta =\frac{3}{2}\). Setting \(\gamma =\frac{27}{4}\delta ^{-\frac{\bar{\nu }}{\delta ^{\bar{\nu }}-1}}\left( 1-\delta ^{-\bar{\nu }}\right) \). After some direct calculations this tree structure corresponds to complexities, \(\mathcal {C}_\alpha = \gamma \alpha n^{1+\frac{1}{\delta ^{\bar{\nu }}-1}}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2}n^2\). In a two-layers setting one gets \(c_1=\left( \frac{3}{2}\right) ^{1/5} n^{2/5}\) and \(c_2=\left( \frac{3}{2}\right) ^{-1/5} n^{3/5}\), which leads to \(\mathcal {C}_\alpha = \gamma \alpha n^{9/5}\) and \(\mathcal {C}_\beta = \frac{\beta }{2} n^2 - \frac{\beta }{2} \left( \frac{3}{2}\right) ^{\frac{1}{5}}n^{\frac{7}{5}}\), where \(\gamma =(\frac{2}{3})^{-2/5}+(\frac{2}{3})^{3/5}\simeq 1.96\) (eventually notice that even for values of n of order \(10^5\), terms of order like \(n^{9/5}\) are not necessarily negligible compared to those of order \(n^2\), and that \(\mathcal {C}_\beta \) is slightly affected by the choice of the tree structure, but the global complexity benefits from the optimization of \(\mathcal {C}_\alpha \)).

Storage footprint First, covariances can be stored in triangular matrices. So temporary objects M, k and K in Algorithm 1 require the storage of \(c_{\max }(c_{\max }+5)/2\) real values. For a given step \(\nu \), \(\nu \ge 2\), building all vectors \(\alpha _i\) requires the storage of \(\sum _{i=1}^{n_{\nu }} c_i^{\nu }=n_{\nu -1}\) values. At last, for a given step \(\nu \), we simultaneously need objects \(M_{\nu -1}, K_{\nu -1}, M_{\nu }, K_{\nu }\), which require the storage of \(n_{\nu -1}(n_{\nu -1}+3)/2 + n_{\nu }(n_{\nu }+3)/2\) real values. In a regular tree, as \(n_\nu \) is decreasing in \(\nu \), the storage footprint is \(\mathcal {S} = (c_{\max }(c_{\max }+5) + n_1(n_1+5) + n_2(n_2+3))/2\). Hence the equivalents for \(\mathcal {S}\) for the different tree structures, \(\mathcal {S}\sim n\) for the two-layer equilibrated tree, \(\mathcal {S}\sim \frac{1}{2}n^{2-2/\bar{\nu }}\) for the \(\bar{\nu }\)-layer, \(\bar{\nu }>2\) and the indicated result for the optimal tree. Simple orders are given in the proposition, which avoids separating the case \(\bar{\nu }=2\) and a cumbersome constant for the optimal tree.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rullière, D., Durrande, N., Bachoc, F. et al. Nested Kriging predictions for datasets with a large number of observations. Stat Comput 28, 849–867 (2018). https://doi.org/10.1007/s11222-017-9766-2

Download citation

Keywords

  • Gaussian process regression
  • Big data
  • Aggregation methods
  • Best linear unbiased predictor
  • Spatial processes