Asymptotic analysis of the learning curve for Gaussian process regression
 877 Downloads
 5 Citations
Abstract
This paper deals with the learning curve in a Gaussian process regression framework. The learning curve describes the generalization error of the Gaussian process used for the regression. The main result is the proof of a theorem giving the generalization error for a large class of correlation kernels and for any dimension when the number of observations is large. From this theorem, we can deduce the asymptotic behavior of the generalization error when the observation error is small. The presented proof generalizes previous ones that were limited to special kernels or to small dimensions (one or two). The theoretical results are applied to a nuclear safety problem.
Keywords
Gaussian process regression Asymptotic mean squared error Learning curves Generalization error Convergence rate1 Introduction
Gaussian process regression is a useful tool to approximate an objective function given some of its observations (Laslett 1994). It has originally been used in geostatistics to interpolate a random field at unobserved locations (Wackernagel 2003; Berger et al. 2001; Gneiting et al. 2010), it has been developed in many areas such as environmental and atmospheric sciences.
This method has become very popular during the last decades to build surrogate models from noisefree observations. For example, it is widely used in the field of “computer experiments” to build models which surrogate an expensive computer code (Sacks et al. 1989). Then, through the fast approximation of the computer code, uncertainty quantification and sensitivity analysis can be performed with a low computational cost.
Nonetheless, for many realistic cases, we do not have direct access to the function to be approximated but only to noisy versions of it. For example, if the objective function is the result of an experiment, the available responses can be tainted by measurement noise. In that case, we can reduce the noise of the observations by repeating the experiments at the same locations. Another example is MonteCarlo based simulators—also called stochastic simulators—which use MonteCarlo or MonteCarlo Markov Chain methods to solve a system of differential equations through its probabilistic interpretation. For such simulators, the noise level can be tuned by the number of MonteCarlo particles used in the procedure.
In this paper, we are interested in obtaining learning curves describing the generalization error—defined as the averaged mean squared error—of the Gaussian process regression as a function of the training set size (Rasmussen and Williams 2006). The problem has been addressed in the statistical and numerical analysis areas. For an overview, the reader is referred to (Ritter 2000b) for a numerical analysis point of view and to (Rasmussen and Williams 2006) for a statistical one. In particular, in the numerical analysis literature, the authors are interested in numerical differentiation of functions from noisy data (Ritter 2000a; Bozzini and Rossini 2003). They have found very interesting results for kernels satisfying the Sacks–Ylvisaker conditions of order \(r\) (Sacks and Ylvisaker 1981) but only valid for 1D or 2D functions.
In the statistical literature Sollich and Hallees (2002) give accurate approximations to the learning curve and Opper and Vivarelli (1999) and Williams and Vivarelli (2000) give upper and lower bounds on it. Their approximations give the asymptotic value of the learning curve (for a very large number of observations). They are based on the Woodbury–Sherman–Morrison matrix inversion lemma (Harville 1997) which holds in finitedimensional cases which correspond to degenerate covariance kernels in our context. Nonetheless, classical kernels used in Gaussian process regression are nondegenerate and we hence are in an infinitedimensional case and the Woodbury–Sherman–Morrison formula cannot be used directly.
To deal with asymptotics of Gaussian process learning curves for more general kernels, some authors have used other definitions of the generalization error. For example, Seeger et al (2008) present consistency results and convergence rates for cumulative log loss of Bayesian prediction. Then, their work is revisited by van der Vaart and van Zanten (2011) who suggest and study another risk which is an upper bound for the one presented in (Seeger et al (2008)).
The main result of this paper is the proof of a theorem giving the value of the Gaussian process regression mean squared error (MSE) for a large training set size when the observation noise variance is proportional to the number of observations. This value is given as a function of the eigenvalues and eigenfunctions of the covariance kernel. From this theorem, we can deduce an approximation of the learning curve for nondegenerate and degenerate kernels [which generalizes the proofs given in (Opper and Vivarelli 1999; Sollich and Halees 2002; Picheny 2009)] and for any dimension [which generalizes the proofs given in (Ritter 2000a, b; Bozzini and Rossini 2003)].
The rate of convergence of the best linear unbiased predictor (BLUP) is of practical interest since it provides a powerful tool for decision support. Indeed, from an initial experimental design set, it can predict the additional computational budget (defined as the number of experiments including repetitions) necessary to reach a given desired accuracy.
The paper is organized as follows. First we present the asymptotic framework considered in this paper in Sect. 2. Although the main results of the paper are theoretical contributions, an application is provided in order to emphasize the possible implications for realworld problems. Second, we present in Sect. 3 the main result of the paper which is the theorem giving the MSE of the considered model for a large training size. This theorem is proved in Sect. 4. Third, we study the rate of convergence of the generalization error when the noise variance decreases in Sect. 5. The theoretical asymptotic rates of convergences are compared to the obtained ones in a numerical simulations and academic examples. Furthermore, a study on how large the training set size should be for the asymptotic formulas to agree with the numerical ones is provided for the specific case of the Brownian motion. Finally, an industrial application to the safety assessment of a nuclear system containing fissile materials is considered in Sect. 6. This real case emphasizes the effectiveness of the theoretical rate of convergence of the BLUP since it predicts a very good approximation of the budget needed to reach a prescribed precision.
2 Generalization error for noisy observations
The general framework of the paper is given in this section. First, the mathematical formalism on which the theoretical developments are based is presented. Then, the considered application is introduced. Finally, the bridge between the theoretical developments and the application is given.
2.1 Asymptotic framework for the analysis of the generalization error
The obtained asymptotic value has already be mentioned in several works (Rasmussen and Williams 2006; Ritter 2000b, a; Bozzini and Rossini 2003; Opper and Vivarelli 1999; Sollich and Halees 2002; Picheny 2009). The original contribution of this paper is a rigorous proof of this result.
2.2 Introduction to stochastic simulators
We present in this section the industrial application studied in Sect. 6.2. A stochastic simulator is a computer code which solves a system of partial differential equations with MonteCarlo methods. It has the particularity to provide noisy observations centered on the true solution of the system. Stochastic simulators are widely used in the field of nuclear physics to solve transport equations and model systems containing fissile materials (e.g. nuclear reactors, storages of fissile materials, spacecraft reactors). In this paper, we are interested in a storage of dry Plutonium(IV) oxide (\(PuO_2\)) used as fuel for nuclear reactors or several spacecrafts. As the \(PuO_2\) is highly toxic, the safety assessment of such storages is of great importance.

\(k_{{{\text {eff}}}} > 1\) leads to an uncontrolled chain reaction due to an increasing neutron population.

\(k_{{\text {eff}}} = 1\) leads to a selfsustained chain reaction with a stable neutron population.

\(k_{{\text {eff}}} < 1\) leads to a faded chain reaction due to an decreasing neutron population.

\(d_{{\text {PuO}}_2} \in [0.5, 4] \text {g.cm}^{3}\), the density of the fissile powder. It is scaled to \([0,1]\).

\(d_{{\text {water}}} \in [0,1] \text {g.cm}^{3}\), the density of water between storage tubes.
2.3 Relation between the application and the considered mathematical formalism
Let us consider that we want to approximate the function \(x \in \mathbb {R}^d \rightarrow f(x) \in \mathbb {R}\) from noisy observations at points \((x_i)_{i=1,\ldots ,n}\) sampled from the design measure \(\mu \) and with \(s\) replications at each point. We hence have \(ns\) data of the form \(z_{i,j} = f(x_i) + \varepsilon _{i,j}\) and we consider that \((\varepsilon _{i,j})_{\begin{array}{c} i=1,\ldots ,n \\ j =1,\ldots ,s \end{array}}\) are independently distributed from a Gaussian distribution with mean zero and variance \(\sigma _\varepsilon ^2\). Then, denoting the vector of observed values by \(z^{n} = (z_i^n)_{i=1,\ldots ,n} = (\sum _{j=1}^s z_{i,j}/s)_{i=1,\ldots ,n}\), the variance of an observation \(z^{n}_i\) is \( \sigma _\varepsilon ^2/s\). We recognize here the output form given in Sect. 2.2. Thus, if we consider a fixed budget \(T=ns\), we have \( \sigma _\varepsilon ^2/s = n \tau \) with \(\tau = \sigma _\varepsilon ^2/T\) and the observation noise variance is proportional to \(n\) (as presented in Sect. 2.1). It means that if we increase the number \(n\) of observations, we automatically increase the uncertainty on the observations. An observation noise variance proportional to \(n\) is natural in the framework of experiments with repetitions or stochastic simulators. Indeed, for a fixed number of experiments (or simulations), the user can decide to perform them in few points with many repetitions (in that case the noise variance will be low) or to perform them in many points with few repetitions (in that case the noise variance will be large).
We note that increasing \(n\) with a fixed \(\tau \) is an idealized asymptotic setting since it would require that the number of replications \(s\) could tend to zero while it has to be a positive integer. However, this issue can be tackled in practice since for real applications \(n\) is finite and one has just to take a budget \(T\) such that \(T \ge n\) (i.e. \(s\ge 1\)). This is a first limitation of the suggested method since it cannot be used for small budget (i.e. when \(T < n\)). A second one is the assumption that \(s\) does not depend on \(x_i\). Indeed, a uniform allocation could not be optimal. In this case, finding the optimal sequence \(\{s_1,s_2,\ldots ,s_n\}\) leading to the minimal error is of practical interest. However, the corresponding observation noise variance will depend on \(x_i\) which means that \(\tau _i\) will depend on \(x_i\) as well. In this case, the presenting results do not hold. Nevertheless, they can be used to provide an upper bound for the convergence of the generalization error by considering the worst case \(\tau =\max _i \tau _i\).
The objective of the industrial example presented in Sect. 6.2 is to determine the budget \(T\) required to reach a prescribe accuracy \({\bar{\varepsilon }}\). To deal with this issue, we first build a Gaussian process regression model from an initial budget \(T_0\) and a large number of observations \(n\). Then, from the results on the learning curve, we deduce the budget \(T\) such that the IMSE equals \({\bar{\varepsilon }}\).
3 Convergence of the learning curve for Gaussian process regression
This section deals with the convergence of the BLUP when the number of observations is large. The rate of convergence of the BLUP is evaluated through the generalization error—i.e. the IMSE—defined in (5). The main theorem of this paper follows:
Theorem 1
The proof of Theorem 1 is given in Sect. 4.
Remark
For nondegenerate kernels such that \( \phi _p(x) _{L^{\infty }} < \infty \) uniformly in \(p\), the convergence is almost sure. Some kernels such as the one of the Brownian motion satisfy this property.
The following theorem gives the asymptotic value of the learning curve when \(n\) is large.
Theorem 2
Proof
From Theorem 1 and the orthonormal property of the basis \((\phi _p(x))_p\) in \(L^2_\mu (\mathbb {R^d})\), the proof of the theorem is straightforward by integration. We note that we can permute the integral and the limit thanks to the dominated convergence theorem since \(\sigma ^2(x) \le k(x,x)\). \(\square \)
A strength of Theorem 2 is that it allows for obtaining the rate of convergence of the learning curve even when the eigenvalues \((\lambda _p)_{p \ge 0}\) are not explicit. Indeed, as presented in Sect. 5.2, this rate can be deduced from the asymptotic behavior of \(\lambda _p\) for large \(p\). Furthermore, this asymptotic behavior is known for usual kernels (fractional Brownian kernel, Matérn covariance kernel, Gaussian covariance kernel, ...). However, this is also a limitation since it could be unknown for general covariance kernels.
3.1 Discussion
The limit obtained is identical to the one presented in (Rasmussen and Williams 2006) Sect. 7.3 Eq. (7.26) for a degenerate kernel. Furthermore, the limit in Eq. (8) corresponds to the average bound given for degenerate kernels in (Opper and Vivarelli 1999) in Sect. 6 Eq. (17) with the correspondence \(\tau = \sigma ^2/n\). In particular, they prove that it is a lower bound for the generalization error and an upper bound for the training error. The training error is defined as the empirical mean \(\sum _{i=1}^n \sigma ^2(x_i)/n\) where \((x_i)_{i=1,\ldots ,n }\) are the design points. They also note that this bound should be exact for the asymptotic \(n\) large since the sum \(\sum _{i=1}^n \sigma ^2(x_i)/n\) approaches to the IMSE asymptotically. Moreover, they numerically observed that this bound is relevant for a Gaussian covariance kernel (Opper and Vivarelli 1999), Eq. (18) which is a nondegenerate kernel. The work of Opper and Vivarelli is also investigated in (Williams and Vivarelli 2000; Sollich and Halees 2002; Picheny 2009). In particular, a proof of Theorem 1 is given for degenerate kernels and the relevance of the bound is illustrated on numerical examples using nondegenerate kernels [e.g. Gaussian covariance kernel and exponential kernel (Rasmussen and Williams 2006)].
We note that the proof of Theorem 1 for nondegenerate kernels is of interest since the usual kernels for Gaussian process regression are nondegenerate and we will exhibit dramatic differences between the learning curves of degenerate and nondegenerate kernels.
4 Proof of Theorem 1
We present in this section the proof of Theorem 1. The aim is to find the asymptotic value of the MSE \(\sigma ^2(x)\) (3) when \(n\) tends to the infinity. The principle of the proof is to find an upper bound and a lower bound for \(\sigma ^2(x)\) which converge to the same quantity. One of the main ideas of the proof is to use the fact that in a Gaussian process regression framework we consider the BLUP, i.e. the one which minimizes the MSE. Therefore, for a given Gaussian process modeling the function \(f(x)\), any LUP has a larger MSE. Furthermore, to provide a lower bound for \(\sigma ^2(x)\), we use the result presented in Theorem 1 for degenerate kernels. Therefore, we start the proof by presenting the degenerate case.
4.1 The degenerate case
4.2 The lower bound for \(\sigma ^2(x)\)
The objective is to find a lower bound for the MSE \(\sigma ^2(x)\) (3) for nondegenerate kernels.
4.3 The upper bound for \(\sigma ^2(x)\)
The choice of the LUP (13) is motivated by the fact that the matrix \(A\) is an approximation of the inverse of the matrix \((n \tau I + K)\) that is tractable in the calculations. Indeed, we have \((n \tau I + K) = L+M\) and thus \((n \tau I + K)^{1} =L^{1}(I+ L^{1}M)^{1}\). Then, the term \((I+ L^{1}M)^{1}\) is approximated with the sum \(\sum _{k=1}^q(1)^k(L^{1}M)^k\). We note that the condition \(p^*\) such that \(\lambda _{p^*} < \tau \) is used to control the convergence of this sum when \(q\) tends to the infinity.
First, we deal with the term \( k(x)^T L^{1} k(x) \) with the following lemma proved in Appendix.
Lemma 1
Lemma 2
We note that the convergences presented in Lemma 2 hold in probability. Then, we have the following lemma proved in Appendix:
Lemma 3
5 Examples of rates of convergence for the learning curve
5.1 Numerical study on the assumptions of Theorem 2
We observe in Fig. 1 that the convergence is effective for \(n < 100\) for all values of \(\tau \). The convergence is robust for small values of \(\tau \): the asymptotic value (8) is a good approximation of the IMSE if \(n \ge 5\) for \(\tau = 10^{1}\), if \(n \ge 20\) for \(\tau = 10^{2}\) and if \(n \ge 60\) for \(\tau = 10^{3}\). This corresponds approximately to the threshold values \(n \tau = 0.5\) for \( \text {IMSE}_\infty = 0.1575, n \tau = 0.2\) for \( \text {IMSE}_\infty = 0.05\) and \(n \tau = 0.06\) for \( \text {IMSE}_\infty = 0.0.0158\) ; or globally to \(n \tau \approx 4 \text {IMSE}_\infty \).
This highlights the relevance of the asymptotic value of the IMSE given in Theorem 2. However, in general we do not have an explicit expression for the eigenvalues of a covariance kernel. In this case, we can obtain the asymptotic expression of \(\text {IMSE}_\infty \) for small \(\tau \) from the asymptotic behavior of the eigenvalues \((\lambda _p)_{p \ge 0}\) for large \(p\). We deal with this issue in the next subsection.
5.2 Rate of convergence for some usual kernels
Theorem 2 gives the asymptotic value of the generalization error as a function of the eigenvalues of the covariance kernel. However, this asymptotic value is hard to handle since the expression of the eigenvalues is rarely known. To deal with this problem, we introduce in Proposition 1 a quantity \(B_\tau \) which has the same rate of convergence of the asymptotic value of the generalization error and which is tractable for our purpose.
Proposition 1
Proof
Proposition 1 shows that the rate of convergence of the generalization error \(\text {IMSE}_\infty \) as a function of \(\tau \) is equivalent to the one of \(B_\tau \). In this section, we analyze the rate of convergence of \(\text {IMSE}_\infty \) (or equivalently \(B_\tau \)) when \(\tau \) is small.
In this section, we consider that the design measure \(\mu \) is uniform on \([0,1]^d\).
Example
Therefore, the IMSE decreases with \(\tau \). We find here a classical result about MonteCarlo convergence which gives that the variance decay is proportional to the observation noise variance (\(n\tau \)) divided by the number of observations (\(n\)) for any dimension. Nevertheless, for nondegenerate kernels, the number of nonzero eigenvalues is infinite and we are hence in an infinitedimensional case (contrarily to the degenerate one). We see in the following examples that we do not conserve the usual MonteCarlo convergence rate in this case which emphasizes the importance of Theorem 1 dealing with nondegenerate kernels.
Example
The associated Gaussian process—called fractional Brownian motion—is Hölder continuous with exponent \(H\varepsilon , \forall \varepsilon > 0\). According to (Bronski 2003), we have the following result:
Lemma 4
The rate of convergence for a fractional Brownian motion with Hurst parameter \(H\) is \(\tau ^{1\frac{1}{2H+1}}\). We note that the case \(H = 1/2\) corresponds to the classical Brownian motion. We observe that the larger the Hurst parameter is (i.e. the more regular the Gaussian process is), the faster the convergence is. Furthermore, for \(H \rightarrow 1\) the convergence rate gets close to \(\tau ^{2/3}\). Therefore, even for the most regular fractional Brownian motion, we are still far from the classical MonteCarlo convergence rate.
Example
This result is in agreement with the one of Ritter (2000a) who proved that for 1dimensional kernels satisfying the Sacks–Ylvisaker of order \(r\) conditions (where \(r\) is an integer), the generalization error for the best linear estimator and experimental design set strategy decays as \(\tau ^{1\frac{1}{2r+2}}\). Indeed, for such kernels, the eigenvalues satisfy the large\(p\) behavior \(\lambda _p \propto 1/p^{2r+2}\) (Rasmussen and Williams 2006) and by following the guideline of the previous examples we find the same convergence rate. We note that the Matérn kernel with parameter \(\nu = r + 1/2\) satisfies the Sacks–Ylvisaker of order \(r\) conditions.
Example
Example
Remark
We can see from the previous examples that for smooth kernels, the convergence rate is close to \(\tau \), i.e. the classical MonteCarlo rate.
5.3 Numerical examples
We compare the previous theoretical results on the rate of convergence of the generalization error with full numerical simulations. In order to observe the asymptotic convergence, we fix \(n = 200\) and we consider \(1/\tau \) varying from \(50\) to \(1000\). The experimental design sets are sampled from a uniform measure on \([0,1]\) and the observation noise is \(n\tau \). To estimate the IMSE (5) we use a trapezoidal numerical integration with 4000 quadrature points over \([0,1]\). Furthermore, to build the convergence curves in Figs. 2 and 3 we use a linear regression with the first value of the IMSE, an intercept fixed to zero (since the IMSE tends to 0 when \(\tau \) tends to 0) and a unique explanatory variable corresponding to the tested convergence (e.g. \(\tau ^{0.1}, \tau \log (1 / \tau )\), ...).
We see in Fig. 2 that the observed rate of convergence is perfectly fitted by the theoretical one. We note that we are far from the classical MonteCarlo rate since we are not in a nondegenerate case.
6 Applications of the learning curve
Let us consider that we want to approximate the function \(x\in \mathbb {R}^d \rightarrow f(x) \) from noisy observations at fixed points \((x_i)_{i=1,\ldots ,n}\), with \(n \gg 1\), sampled from the design measure \(\mu \) and with \(s\) replications at each point \(x_i\). In Sect. 6.1 we present how to determine the needed budget \(T=ns\) to achieve a prescribed precision. Then, in Sect. 6.2, we illustrate this method on an industrial example.
6.1 Estimation of the budget required to reach a prescribed precision
Let us consider a prescribed generalization error denoted by \({\bar{\varepsilon }}\). The purpose of this subsection is to determine from an initial budget \(T_0\) the budget \(T\) for which the generalization error reaches the value \({\bar{\varepsilon }}\).
First, we build an initial experimental design set \((x_i^\text {train})_{i=1,\ldots ,n}\) sampled with respect to the design measure \(\mu \) and with \(s^*\) replications at each point such that \(T_0=ns^*\). From the \(s^*\) replications \((z_{i,j})_{j=1,\ldots ,s^*}\), we can estimate the observation noise variances \(\sigma _\varepsilon ^2\) with a classical empirical estimator: \( \bar{\sigma }^2_\varepsilon = \sum _{i=1}^n \sum _{j=1}^{s^*}(z_{i,j}z_i^n)^2/(n(s^*1)),\, z_i^n = \sum _{j=1}^{s^*}z_{i,j}/s^*\).
Second, we use the observations \(z_i^n = (\sum _{j=1}^{s^*}z_{i,j})/s^*\) to estimate the covariance kernel \(k(x,x')\). In practice, we consider a parametrized family of covariance kernels and we select the parameters which maximize the likelihood (Stein 1999).
Third, from Theorem 2 we can get the expression of the generalization error decay with respect to \(T\) (denoted by \(\hbox {IMSE}_T\)). Therefore, we just have to determine the budget \(T\) such that \(\text {IMSE}_T = \varvec{\bar{\varepsilon }}\). In practice, we will not use Theorem 2 but the asymptotic results described in Sect. 5.2.
This strategy is applied to an industrial case in Sect. 6.2. We note that in the application presented in Sect. 6.2, we have \(s^*=1\). In fact, in this example the observations are themselves obtained by an empirical mean of a MonteCarlo sample and thus the noise variance can be estimated without processing replications.
6.2 Industrial case: MORET code
We illustrate in this section an industrial application of our results about the rate of convergence of the IMSE.
6.2.1 Data presentation
We use in this section the notation presented in Sect. 2.2. The outputs of the MORET code at point \(x_i\) are denoted by \(Y_j(x_i)\) where \(j=1,\ldots ,s_i\) and \(i=1,\ldots ,n\).
A large data base \((Y_j(x_i))_{ i=1,\ldots ,5625, j=1,\ldots ,200}\) is available to us. We divide it into a training set and a test set. The 5625 points \(x_i\) of the data base come from a \(75 \times 75\) grid over \([0,1]^2\). The training set consists of \(n= 100\) points \((x_i^{\text {train}})_{i=1,\ldots ,n}\) extracted from the complete data base using a Latin Hypercube Sample (Fang et al. 2006) optimized with respect to the maximin criterion and of the first observations \((Y_1(x_i^{\text {train}}))_{i=1,\ldots ,100}\). We note that the maximin criterion aims to maximize the minimal distance (with respect to the \(L_2\)norm) between the points of the design. We will use the other 5525 points as a test set.
The aim of the study is—given the training set—to predict the budget needed to achieve a prescribed precision for the surrogate model.
Furthermore, the observation noise variance \(\sigma ^2_{\varepsilon }\) is estimated by \( \bar{\sigma }^2_\varepsilon = 3.3\times 10^{3}\) (see Sect. 6.1).
6.2.2 Model selection
Due to the fact that the convergence rate is strongly dependent of the regularity parameter \(\nu \), we have to perform a good estimation of this hyperparameter to evaluate the model error decay accurately. Note that we cannot have a closed form expression for the estimator of \(\sigma ^2\), it hence has to be estimated jointly with \(\theta \) and \(\nu \).
Let us consider the vector of parameters \(\phi = (\nu , \theta _1, \theta _2, \sigma ^2)\). In order to perform the maximization, we have first randomly generated a set of 10,000 parameters \((\phi _k)_{k=1,\ldots ,10^4}\) on the domain \([0.5,3] \times [0.01,2] \times [0.01,2]\times [0.01,1]\). We have then selected the 150 best parameters (i.e. the ones maximizing the concentrated Maximum Likelihood) and we have started a quasiNewton based maximization from these parameters. More specifically, we have used the BFGS method (Shanno 1970). Finally, from the results of the 150 maximization procedures, we have selected the best parameter. We note that the quasiNewton based maximizations have all converged to two parameter values, around 30 % to the actual maximum and 70 % to another local maximum.
The estimation of the hyperparameters are \(\nu = 1.31\), \(\theta _1 = 0.67, \theta _2 = 0.45\) and \(\sigma ^2= 0.24 \). This means that we have a rough surrogate model which is not differentiable and \(\alpha \)Hölder continuous with exponent \(\alpha = 0.81\). The variance of the observations is \(\bar{\sigma }^2_\varepsilon =3.3\times 10^{3}\), using the same notations as Sect. 2.3, we have \(\tau = \bar{\sigma }^2_\varepsilon /T_0 \) with \(T_0 = n\) (it corresponds to \(s=1\)).
The IMSE of the Gaussian process regression is \(\text {IMSE}_{T_0}=1.0\times 10^{3}\) and its empirical mean squared error is \(\text {EMSE}_{T_0} = 1.2\times 10^{3}\). To compute the empirical mean squared error (EMSE), we use the observations \((Y_j(x_i))_{i=1,\ldots ,5525,\, j=1 \ldots , 200}\) with \(x_i \ne x_k^{\text {train}}\) \(\forall k=1,\ldots ,100, i=1,\ldots , 5525\) and to compute the IMSE (5) (that depends only on the positions of the training set and on the selected hyperparameters) we use a trapezoidal numerical integration into a \(75 \times 75\) grid over \([0,1]^2\). For \(s=200\), the observation variance of the output \(k_{\text {eff},s}(x)\) equals \({ \bar{\sigma }^2_\varepsilon }/{200}=1.64\times 10^{5}\) and is neglected for the estimation of the empirical error. We can see that the IMSE is close to the empirical MSE which means that our model describes the observations accurately.
6.2.3 Convergence of the IMSE
We see empirically that the EMSE of \(\varvec{\bar{\varepsilon }} = 2.0\times 10^{4}\) is achieved for \(s = 31\). This shows that the predicted IMSE and the empirical MSE are close and that the selected kernel captures the regularity of the response accurately.
Let us consider the classical MonteCarlo convergence rate \( \bar{\sigma }_\varepsilon ^2/T\), which corresponds to the convergence rate of degenerate kernels, i.e. in the finitedimensional case. Figure 5 compares the theoretical rate of convergence of the IMSE with the classical MonteCarlo one. We see that the MonteCarlo decay is too fast and does not represent correctly the empirical MSE decay. If we had considered the rate of convergence \(\text {IMSE} \sim \bar{\sigma }_\varepsilon ^2/T\), we would have reached an IMSE of \(\varvec{\bar{\varepsilon }} = 2.0\times 10^{4}\) for \(s = 6\) (which is very far from the observed value \( s = 31\)).
7 Conclusion
The main result of this paper is the proof of a theorem giving the Gaussian process regression MSE when the number of observations is large and the observation noise variance is proportional to the number of observations. The proof generalizes previous ones which prove this result in dimension one or two or for a restricted class of covariance kernels (for degenerate ones).
A first limitation of the presented results is that the noise variance generally does not depend on the number of observations. The additive dependence of the noise variance in the number of observations is a technical assumption which allows for controlling the convergence of the learning curve. However, it is natural in the framework of experiments with replications or MonteCarlo simulators. Deriving the presented results for the case of constant noise is still an open problem and is of great practical interest.
The asymptotic value of the MSE is derived in terms of the eigenvalues and eigenfunctions of the covariance function and holds for degenerate and nondegenerate kernels and for any dimension. From this theorem, we can deduce the asymptotic behavior of the generalization error—defined in this paper as the IMSE—as a function of the reduced observation noise variance (it corresponds to the noise variance when the number of observations equals one). A strength of this theorem is that the rate of convergence of the generalization error can be deduced from the one of the eigenvalues which is known for usual covariance kernels. The relevance of this rate of convergence is emphasized on a numerical study for different kernels. However, this leads to another limitation since the presented results cannot be used for general covariance kernels for which the eigenvalue decay rate is unknown.
The significant differences between the rate of convergence of degenerate and nondegenerate kernels highlight the importance to prove this result for nondegenerate kernels. This is especially important as usual kernels for Gaussian process regression are nondegenerate.
Finally, for practical perspectives, the presented method allows for evaluating the computational budget required to reach a given accuracy. It has been successfully applied to a realword problem about the safety assessment of a nuclear system. However, it is efficient for specific applications (e.g. stochastic simulators with a constant observation noise variance) and when the computational budget is important. More investigations have to be performed to deal with the cases of heterogeneous noise, noisefree simulators or for very limited computational budget.
Footnotes
 1.
If \(B\) is a nonsingular \(p\times p\) matrix, \(C\) a nonsingular \(m \times m\) matrix and \(A\) a \(m \times p\) matrix with \(m,p < \infty \), then \((B+AC^{1}A)^{1} = B^{1}  B^{1}A(A^TB^{1}A+C)^{1}A^TB^{1}\).
Notes
Acknowledgments
The authors are grateful to Dr. Yann Richet of the IRSN—Institute for Radiological Protection and Nuclear Safety—for providing the data for the industrial case through the reDICE project.
References
 Abramowitz, M., & Stegun, I. A. (1965). Handbook of mathematical functions. New York: Dover.zbMATHGoogle Scholar
 Berger, J. O., De Oliveira, V., & Sans, B. (2001). Objective bayesian analysis of spatially correlated data objective bayesian analysis of spatially correlated data. Journal of the American Statistical Association, 96, 1361–1374.MathSciNetCrossRefzbMATHGoogle Scholar
 Bozzini, M., & Rossini, M. (2003). Numerical differentiation of 2d functions from noisy data. Computer and Mathematics with Applications, 45, 309–327.MathSciNetCrossRefzbMATHGoogle Scholar
 Bronski, J. C. (2003). Asymptotics of Karhunen–Loève eigenvalues and tight constants for probability distributions of passive scalar transport. Communications in Mathematical Physics, 238, 563–582.MathSciNetCrossRefzbMATHGoogle Scholar
 Fang, K. T., Li, R., & Sudjianto, A. (2006). Design and modeling for computer experiments. Computer science and data analysis series. London: Chapman & Hall.zbMATHGoogle Scholar
 Fernex, F., Heulers, L., Jacquet, O., Miss, J., Richet, Y. (2005). The Moret 4b monte carlo code new features to treat complex criticality systems. In: MandC International Conference on Mathematics and Computation Supercomputing, Reactor and Nuclear and Biological Application, Avignon, France.Google Scholar
 Gneiting, T., Kleiber, W., & Schlater, M. (2010). Matérn crosscovariance functions for multivariate random fields. Journal of the American Statistical Association, 105, 1167–1177.MathSciNetCrossRefzbMATHGoogle Scholar
 Harville, D. A. (1997). Matrix algebra from statistician’s perspective. New York: SpringerVerlag.CrossRefzbMATHGoogle Scholar
 Laslett, G. M. (1994). Kriging and splines: An empirical comparison of their predictive performance in some applications kriging and splines: An empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, 89, 391–400.MathSciNetCrossRefGoogle Scholar
 Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A, 209, 441–458.CrossRefzbMATHGoogle Scholar
 Nazarov, A. I., & Nikitin, Y. Y. (2004). Exact \(\text{ l }_2\)small ball behaviour of integrated Gaussian processes and spectral asymptotics of boundary value problems. Probability Theory and Related Fields, 129, 469–494.MathSciNetCrossRefzbMATHGoogle Scholar
 Opper, M., & Vivarelli, F. (1999). General bounds on Bayes errors for regression with Gaussian processes. Advances in Neural Information Processing Systems, 11, 302–308.Google Scholar
 Picheny, V. (2009). Improving accuracy and compensating for uncertainty in surrogate modeling. PhD thesis, Ecole Nationale Supérieure des Mines de Saint Etienne.Google Scholar
 Pusev, R. S. (2011). Small deviation asymptotics for Matérn processes and fields under weighted quadratic norm. Theory of Probability and its Applications, 55, 164–172.Google Scholar
 Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.zbMATHGoogle Scholar
 Ritter, K. (2000a). Almost optimal differentiation using noisy data. Journal of Approximation Theory, 86, 293–309.MathSciNetCrossRefzbMATHGoogle Scholar
 Ritter, K. (2000b). Averagecase analysis of numerical problems. Berlin: Springer Verlag.CrossRefzbMATHGoogle Scholar
 Sacks, J., & Ylvisaker, D. (1981). Variance estimation for approximately linear models. Series Statistics, 12, 147–162.MathSciNetCrossRefzbMATHGoogle Scholar
 Sacks, J., Welch, W. J., Mitchell, T. J., & Wynn, H. P. (1989). Design and analysis of computer experiments. Statistical Science, 4, 409–423.MathSciNetCrossRefzbMATHGoogle Scholar
 Seeger, M. W., Kakade, S. M., & Foster, D. P. (2008). Information consistency of nonparametric Gaussian process methods. IEEE Transactions on Information Theory, 54(5), 2376–2382.Google Scholar
 Shanno, D. F. (1970). Conditioning of quasiNewton methods for function minimization. Mathematics of Computation, 24, 647–656.MathSciNetCrossRefzbMATHGoogle Scholar
 Sollich, P., & Halees, A. (2002). Learning curves for Gaussian process regression: Approximations and bounds. Neural Computation, 14, 1393–1428.CrossRefzbMATHGoogle Scholar
 Stein, M. L. (1999). Interpolation of spatial data. Series in statistics. New York: Springer.CrossRefGoogle Scholar
 van der Vaart, A., & van Zanten, H. (2011). Information rates of nonparametric Gaussian process methods. The Journal of Machine Learning Research, 12, 2095–2119.Google Scholar
 Wackernagel, H. (2003). Multivariate geostatistics. Berlin: SpringerVerlag.CrossRefzbMATHGoogle Scholar
 Williams, C. K. I., & Vivarelli, F. (2000). Upper and lower bounds on the learning curve for Gaussian processes. Machine Learning, 40, 77–102.CrossRefGoogle Scholar