Advertisement

Machine Learning

, Volume 98, Issue 3, pp 407–433 | Cite as

Asymptotic analysis of the learning curve for Gaussian process regression

  • Loic Le Gratiet
  • Josselin Garnier
Article

Abstract

This paper deals with the learning curve in a Gaussian process regression framework. The learning curve describes the generalization error of the Gaussian process used for the regression. The main result is the proof of a theorem giving the generalization error for a large class of correlation kernels and for any dimension when the number of observations is large. From this theorem, we can deduce the asymptotic behavior of the generalization error when the observation error is small. The presented proof generalizes previous ones that were limited to special kernels or to small dimensions (one or two). The theoretical results are applied to a nuclear safety problem.

Keywords

Gaussian process regression Asymptotic mean squared error  Learning curves Generalization error Convergence rate 

1 Introduction

Gaussian process regression is a useful tool to approximate an objective function given some of its observations (Laslett 1994). It has originally been used in geostatistics to interpolate a random field at unobserved locations (Wackernagel 2003; Berger et al. 2001; Gneiting et al. 2010), it has been developed in many areas such as environmental and atmospheric sciences.

This method has become very popular during the last decades to build surrogate models from noise-free observations. For example, it is widely used in the field of “computer experiments” to build models which surrogate an expensive computer code (Sacks et al. 1989). Then, through the fast approximation of the computer code, uncertainty quantification and sensitivity analysis can be performed with a low computational cost.

Nonetheless, for many realistic cases, we do not have direct access to the function to be approximated but only to noisy versions of it. For example, if the objective function is the result of an experiment, the available responses can be tainted by measurement noise. In that case, we can reduce the noise of the observations by repeating the experiments at the same locations. Another example is Monte-Carlo based simulators—also called stochastic simulators—which use Monte-Carlo or Monte-Carlo Markov Chain methods to solve a system of differential equations through its probabilistic interpretation. For such simulators, the noise level can be tuned by the number of Monte-Carlo particles used in the procedure.

In this paper, we are interested in obtaining learning curves describing the generalization error—defined as the averaged mean squared error—of the Gaussian process regression as a function of the training set size (Rasmussen and Williams 2006). The problem has been addressed in the statistical and numerical analysis areas. For an overview, the reader is referred to (Ritter 2000b) for a numerical analysis point of view and to (Rasmussen and Williams 2006) for a statistical one. In particular, in the numerical analysis literature, the authors are interested in numerical differentiation of functions from noisy data (Ritter 2000a; Bozzini and Rossini 2003). They have found very interesting results for kernels satisfying the Sacks–Ylvisaker conditions of order \(r\) (Sacks and Ylvisaker 1981) but only valid for 1-D or 2-D functions.

In the statistical literature Sollich and Hallees (2002) give accurate approximations to the learning curve and Opper and Vivarelli (1999) and Williams and Vivarelli (2000) give upper and lower bounds on it. Their approximations give the asymptotic value of the learning curve (for a very large number of observations). They are based on the Woodbury–Sherman–Morrison matrix inversion lemma (Harville 1997) which holds in finite-dimensional cases which correspond to degenerate covariance kernels in our context. Nonetheless, classical kernels used in Gaussian process regression are non-degenerate and we hence are in an infinite-dimensional case and the Woodbury–Sherman–Morrison formula cannot be used directly.

To deal with asymptotics of Gaussian process learning curves for more general kernels, some authors have used other definitions of the generalization error. For example, Seeger et al (2008) present consistency results and convergence rates for cumulative log loss of Bayesian prediction. Then, their work is revisited by van der Vaart and van Zanten (2011) who suggest and study another risk which is an upper bound for the one presented in (Seeger et al (2008)).

The main result of this paper is the proof of a theorem giving the value of the Gaussian process regression mean squared error (MSE) for a large training set size when the observation noise variance is proportional to the number of observations. This value is given as a function of the eigenvalues and eigenfunctions of the covariance kernel. From this theorem, we can deduce an approximation of the learning curve for non-degenerate and degenerate kernels [which generalizes the proofs given in (Opper and Vivarelli 1999; Sollich and Halees 2002; Picheny 2009)] and for any dimension [which generalizes the proofs given in (Ritter 2000a, b; Bozzini and Rossini 2003)].

The rate of convergence of the best linear unbiased predictor (BLUP) is of practical interest since it provides a powerful tool for decision support. Indeed, from an initial experimental design set, it can predict the additional computational budget (defined as the number of experiments including repetitions) necessary to reach a given desired accuracy.

The paper is organized as follows. First we present the asymptotic framework considered in this paper in Sect. 2. Although the main results of the paper are theoretical contributions, an application is provided in order to emphasize the possible implications for real-world problems. Second, we present in Sect. 3 the main result of the paper which is the theorem giving the MSE of the considered model for a large training size. This theorem is proved in Sect. 4. Third, we study the rate of convergence of the generalization error when the noise variance decreases in Sect. 5. The theoretical asymptotic rates of convergences are compared to the obtained ones in a numerical simulations and academic examples. Furthermore, a study on how large the training set size should be for the asymptotic formulas to agree with the numerical ones is provided for the specific case of the Brownian motion. Finally, an industrial application to the safety assessment of a nuclear system containing fissile materials is considered in Sect. 6. This real case emphasizes the effectiveness of the theoretical rate of convergence of the BLUP since it predicts a very good approximation of the budget needed to reach a prescribed precision.

2 Generalization error for noisy observations

The general framework of the paper is given in this section. First, the mathematical formalism on which the theoretical developments are based is presented. Then, the considered application is introduced. Finally, the bridge between the theoretical developments and the application is given.

2.1 Asymptotic framework for the analysis of the generalization error

Let us suppose that we want to approximate an objective function \(x \in \mathbb {R}^d \rightarrow f(x) \in \mathbb {R}\) from noisy observations of it at points \((x_i)_{i=1,\ldots ,n}\) with \(x_i \in \mathbb {R}^d\). The points of the experimental design set \((x_i)_{i=1,\ldots ,n}\) are supposed to be sampled from the probability measure \(\mu \) over \(\mathbb {R}^d\). \(\mu \) is called the design measure, it can have either a compact support (for a bounded input parameter space domain) or unbounded support (for unbounded input parameter space). We hence have \(n\) observations of the form \(z_i = f(x_i) + \varepsilon _i\) and we consider that \((\varepsilon _i)_{i=1,\ldots ,n}\) are independently sampled from the Gaussian distribution with mean zero and variance \(n\tau \):
$$\begin{aligned} \varepsilon \sim \mathcal {N}(0,n \tau ). \end{aligned}$$
(1)
with \(\tau \) a positive constant. Note that the number of observations and the observation noise variance are both controlled by \(n\). A noise additive in the number of observations is one of the main assumptions of this article. Intuitively, it allows for controlling the convergence of the generalization error when \(n\) tends to infinity by increasing the regularization. However, this is also the main limitation of the paper since the noise variance is generally independent of the number of observations. As presented in Sect. 2.3, this assumption is suitable for the particular cases of stochastic simulators or experiments with repetitions when the number of simulations or experiments is fixed. The issue of the convergence of the generalization error for more general cases is still an open problem.
The main idea of the Gaussian process regression is to suppose that the objective function \(f(x)\) is a realization of a Gaussian process \(Z(x)\) with a known mean and a known covariance kernel \(k(x,x')\). The mean can be considered equal to zero without loss of generality. Then, denoting by \(z^{n} = [f(x_i)+ \varepsilon _i]_{1\le i \le n}\) the vector of length \(n\) containing the noisy observations, we choose as predictor the BLUP given by the equation:
$$\begin{aligned} \hat{f}(x) = k(x)^T (K+n\tau I)^{-1}z^{n}, \end{aligned}$$
(2)
where \(k(x) = [k(x,x_i)]_{1\le i \le n}\) is the \(n\)-vector containing the covariances between \(Z(x)\) and \(Z(x_i), \quad 1\le i \le n, K = [k(x_i,x_j)]_{1\le i,j \le n}\) is the \(n \times n\)-matrix containing the covariances between \(Z(x_i)\) and \(Z(x_j), \quad 1\le i ,j \le n\) and \(I\) the \(n \times n\) identity matrix. We note here that the unbiasedness means that \(\mathbb {E}[\hat{f} (x)] = \mathbb {E}[Z(x)] \) where \(\mathbb {E}\) stands for the expectation with respect to the distribution of the Gaussian process \(Z(x)\) and the noise \(\varepsilon \). The BLUP minimizes the MSE which equals:
$$\begin{aligned} \sigma ^2(x) = k(x,x) - k(x)^T (K+n\tau I)^{-1}k(x). \end{aligned}$$
(3)
Indeed, if we consider a linear unbiased predictor (LUP) of the form \(a(x)^T z^{n}\), its MSE is given by:
$$\begin{aligned} \mathbb {E}\left[ (Z(x) - a^T(x) Z^{n})^2\right] = k(x,x) - 2a(x)^Tk(x) + a(x)^T (K+n\tau I) a(x), \end{aligned}$$
(4)
where \(Z^{n} = [Z(x_i)+\varepsilon _i]_{1 \le i \le n}\). The value of \(a(x)\) minimizing (4) is \(a_{\text {opt}}(x)^T=k(x)^T (K+n\tau I)^{-1}\). Therefore, the BLUP given by \(a_{\text {opt}}(x)^T z^{n}\) is equal to (2) and by substituting \(a(x)\) with \(a_{\text {opt}}(x)\) in Eq. (4) we obtain the MSE of the BLUP given by Eq. (3).
The main focus of this paper is the asymptotic value of \(\sigma ^2(x)\) when \(n \rightarrow +\infty \). From it, we can deduce the asymptotic value of the integrated mean squared error (IMSE)—also called learning curve or generalization error—when \(n \rightarrow +\infty \). The IMSE is defined by:
$$\begin{aligned} \text {IMSE} = \int \limits _{\mathbb {R}^d} \sigma ^2(x) \, d\mu (x), \end{aligned}$$
(5)
where \(\mu \) is the design measure of the input space parameters.

The obtained asymptotic value has already be mentioned in several works (Rasmussen and Williams 2006; Ritter 2000b, a; Bozzini and Rossini 2003; Opper and Vivarelli 1999; Sollich and Halees 2002; Picheny 2009). The original contribution of this paper is a rigorous proof of this result.

2.2 Introduction to stochastic simulators

We present in this section the industrial application studied in Sect. 6.2. A stochastic simulator is a computer code which solves a system of partial differential equations with Monte-Carlo methods. It has the particularity to provide noisy observations centered on the true solution of the system. Stochastic simulators are widely used in the field of nuclear physics to solve transport equations and model systems containing fissile materials (e.g. nuclear reactors, storages of fissile materials, spacecraft reactors). In this paper, we are interested in a storage of dry Plutonium(IV) oxide (\(PuO_2\)) used as fuel for nuclear reactors or several spacecrafts. As the \(PuO_2\) is highly toxic, the safety assessment of such storages is of great importance.

One of the most important factors used to assess the safety of a system containing fissile materials is the neutron multiplication coefficient usually denoted by \(k_\text {eff}\). It is the average number of neutrons from one fission that cause another fission. This factor models the criticality of a chain nuclear reaction:
  • \(k_{{{\text {eff}}}} > 1\) leads to an uncontrolled chain reaction due to an increasing neutron population.

  • \(k_{{\text {eff}}} = 1\) leads to a self-sustained chain reaction with a stable neutron population.

  • \(k_{{\text {eff}}} < 1\) leads to a faded chain reaction due to an decreasing neutron population.

The neutron multiplication factor is evaluated using the stochastic simulator called MORET (Fernex et al. 2005). It depends on many parameters. However, we only focus here on the following quantities:
  • \(d_{{\text {PuO}}_2} \in [0.5, 4] \text {g.cm}^{-3}\), the density of the fissile powder. It is scaled to \([0,1]\).

  • \(d_{{\text {water}}} \in [0,1] \text {g.cm}^{-3}\), the density of water between storage tubes.

We use the notation \(x = (d_{{\text {PuO}}_2}, d_{\text {water}})\) for the input parameters. Let us denote by \((Y_j(x))_{j=1,\ldots ,s}\) the output of the MORET code at point \(x\). \((Y_j(x))_{j=1,\ldots ,s}\) are realizations of independent and identically distributed random variables centered on \(k_{\text {eff}}(x)\). They are themselves obtained by an empirical mean of a Monte-Carlo sample of 4000 particles. From these particles, we can estimate the variance \(\sigma ^2_{\varepsilon }\) of the observation \(Y_j(x)\) by a classical empirical estimator.
Finally we can estimate \(k_{\text {eff}}(x)\) from the following quantity:
$$\begin{aligned} k_{\text {eff},s}(x) = \frac{1}{s} \sum _{j=1}^{s}Y_j(x). \end{aligned}$$
Therefore, the variance of an observation \(k_{{\text {eff}},s}(x)\) equals \(\sigma ^2_\varepsilon /s\).

2.3 Relation between the application and the considered mathematical formalism

Let us consider that we want to approximate the function \(x \in \mathbb {R}^d \rightarrow f(x) \in \mathbb {R}\) from noisy observations at points \((x_i)_{i=1,\ldots ,n}\) sampled from the design measure \(\mu \) and with \(s\) replications at each point. We hence have \(ns\) data of the form \(z_{i,j} = f(x_i) + \varepsilon _{i,j}\) and we consider that \((\varepsilon _{i,j})_{\begin{array}{c} i=1,\ldots ,n \\ j =1,\ldots ,s \end{array}}\) are independently distributed from a Gaussian distribution with mean zero and variance \(\sigma _\varepsilon ^2\). Then, denoting the vector of observed values by \(z^{n} = (z_i^n)_{i=1,\ldots ,n} = (\sum _{j=1}^s z_{i,j}/s)_{i=1,\ldots ,n}\), the variance of an observation \(z^{n}_i\) is \( \sigma _\varepsilon ^2/s\). We recognize here the output form given in Sect. 2.2. Thus, if we consider a fixed budget \(T=ns\), we have \( \sigma _\varepsilon ^2/s = n \tau \) with \(\tau = \sigma _\varepsilon ^2/T\) and the observation noise variance is proportional to \(n\) (as presented in Sect. 2.1). It means that if we increase the number \(n\) of observations, we automatically increase the uncertainty on the observations. An observation noise variance proportional to \(n\) is natural in the framework of experiments with repetitions or stochastic simulators. Indeed, for a fixed number of experiments (or simulations), the user can decide to perform them in few points with many repetitions (in that case the noise variance will be low) or to perform them in many points with few repetitions (in that case the noise variance will be large).

We note that increasing \(n\) with a fixed \(\tau \) is an idealized asymptotic setting since it would require that the number of replications \(s\) could tend to zero while it has to be a positive integer. However, this issue can be tackled in practice since for real applications \(n\) is finite and one has just to take a budget \(T\) such that \(T \ge n\) (i.e. \(s\ge 1\)). This is a first limitation of the suggested method since it cannot be used for small budget (i.e. when \(T < n\)). A second one is the assumption that \(s\) does not depend on \(x_i\). Indeed, a uniform allocation could not be optimal. In this case, finding the optimal sequence \(\{s_1,s_2,\ldots ,s_n\}\) leading to the minimal error is of practical interest. However, the corresponding observation noise variance will depend on \(x_i\) which means that \(\tau _i\) will depend on \(x_i\) as well. In this case, the presenting results do not hold. Nevertheless, they can be used to provide an upper bound for the convergence of the generalization error by considering the worst case \(\tau =\max _i \tau _i\).

The objective of the industrial example presented in Sect. 6.2 is to determine the budget \(T\) required to reach a prescribe accuracy \({\bar{\varepsilon }}\). To deal with this issue, we first build a Gaussian process regression model from an initial budget \(T_0\) and a large number of observations \(n\). Then, from the results on the learning curve, we deduce the budget \(T\) such that the IMSE equals \({\bar{\varepsilon }}\).

3 Convergence of the learning curve for Gaussian process regression

This section deals with the convergence of the BLUP when the number of observations is large. The rate of convergence of the BLUP is evaluated through the generalization error—i.e. the IMSE—defined in (5). The main theorem of this paper follows:

Theorem 1

Let us consider \(Z(x)\) a Gaussian process with zero mean and covariance kernel \(k(x,x') \in \mathcal {C}^0(\mathbb {R}^d \times \mathbb {R}^d )\) and \((x_i)_{i=1,\ldots ,n}\) an experimental design set of \(n\) independent random points sampled with the probability measure \(\mu \) on \(\mathbb {R}^d\). We assume that \(\sup _{x \in \mathbb {R}^d} k(x,x) < \infty \). According to Mercer’s theorem (Mercer 1909), we have the following representation of \(k(x,x')\):
$$\begin{aligned} k(x,x') = \sum _{p \ge 0} \lambda _p \phi _p(x) \phi _p(x'), \end{aligned}$$
(6)
where \((\phi _p(x) )_p\) is an orthonormal basis of \(L^2_\mu (\mathbb {R}^d)\) (denoting the set of square integrable functions) consisting of eigenfunctions of \((T_{\mu ,k} f)(x)=\int _{\mathbb {R}^d} k(x,x')f(x')d\mu (x')\) and \(\lambda _p\) is the nonnegative sequence of corresponding eigenvalues sorted in decreasing order. Then, for a non-degenerate kernel—i.e. when \(\lambda _p > 0, \forall p > 0\)—we have the following convergence in probability for the MSE (3) of the BLUP:
$$\begin{aligned} \sigma ^2(x) \mathop {\longrightarrow }\limits ^{n \rightarrow \infty } \sum _{p \ge 0} \frac{ \tau \lambda _p}{\tau + \lambda _p} \phi _p(x)^2. \end{aligned}$$
(7)
For degenerate kernels—i.e. when only a finite number of \(\lambda _p\) are not zero—the convergence is almost sure. We note that we have the convergences with respect to the distribution of the points \((x_i)_{i=1,\ldots ,n}\) of the experimental design set.

The proof of Theorem 1 is given in Sect. 4.

Remark

For non-degenerate kernels such that \(|| \phi _p(x) ||_{L^{\infty }} < \infty \) uniformly in \(p\), the convergence is almost sure. Some kernels such as the one of the Brownian motion satisfy this property.

The following theorem gives the asymptotic value of the learning curve when \(n\) is large.

Theorem 2

Let us consider \(Z(x)\) a Gaussian process with known mean and covariance kernel \(k(x,x') \in \mathcal {C}^0(\mathbb {R}^d \times \mathbb {R}^d )\) such that \(\sup _{x \in \mathbb {R}^d} k(x,x) < \infty \) and \((x_i)_{i=1,\ldots ,n}\) an experimental design set of \(n\) independent random points sampled with the probability measure \(\mu \) on \(\mathbb {R}^d\). Then, for a non-degenerate kernel, we have the following convergence in probability:
$$\begin{aligned} {\text {IMSE}} \mathop {\longrightarrow }\limits ^{n \rightarrow \infty } \sum _{p \ge 0} \frac{ \tau \lambda _p}{\tau + \lambda _p}. \end{aligned}$$
(8)
For degenerate kernels, the convergence is almost sure.

Proof

From Theorem 1 and the orthonormal property of the basis \((\phi _p(x))_p\) in \(L^2_\mu (\mathbb {R^d})\), the proof of the theorem is straightforward by integration. We note that we can permute the integral and the limit thanks to the dominated convergence theorem since \(\sigma ^2(x) \le k(x,x)\). \(\square \)

A strength of Theorem 2 is that it allows for obtaining the rate of convergence of the learning curve even when the eigenvalues \((\lambda _p)_{p \ge 0}\) are not explicit. Indeed, as presented in Sect. 5.2, this rate can be deduced from the asymptotic behavior of \(\lambda _p\) for large \(p\). Furthermore, this asymptotic behavior is known for usual kernels (fractional Brownian kernel, Matérn covariance kernel, Gaussian covariance kernel, ...). However, this is also a limitation since it could be unknown for general covariance kernels.

3.1 Discussion

The limit obtained is identical to the one presented in (Rasmussen and Williams 2006) Sect. 7.3 Eq. (7.26) for a degenerate kernel. Furthermore, the limit in Eq. (8) corresponds to the average bound given for degenerate kernels in (Opper and Vivarelli 1999) in Sect. 6 Eq. (17) with the correspondence \(\tau = \sigma ^2/n\). In particular, they prove that it is a lower bound for the generalization error and an upper bound for the training error. The training error is defined as the empirical mean \(\sum _{i=1}^n \sigma ^2(x_i)/n\) where \((x_i)_{i=1,\ldots ,n }\) are the design points. They also note that this bound should be exact for the asymptotic \(n\) large since the sum \(\sum _{i=1}^n \sigma ^2(x_i)/n\) approaches to the IMSE asymptotically. Moreover, they numerically observed that this bound is relevant for a Gaussian covariance kernel (Opper and Vivarelli 1999), Eq. (18) which is a non-degenerate kernel. The work of Opper and Vivarelli is also investigated in (Williams and Vivarelli 2000; Sollich and Halees 2002; Picheny 2009). In particular, a proof of Theorem 1 is given for degenerate kernels and the relevance of the bound is illustrated on numerical examples using non-degenerate kernels [e.g. Gaussian covariance kernel and exponential kernel (Rasmussen and Williams 2006)].

We note that the proof of Theorem 1 for non-degenerate kernels is of interest since the usual kernels for Gaussian process regression are non-degenerate and we will exhibit dramatic differences between the learning curves of degenerate and non-degenerate kernels.

4 Proof of Theorem 1

We present in this section the proof of Theorem 1. The aim is to find the asymptotic value of the MSE \(\sigma ^2(x)\) (3) when \(n\) tends to the infinity. The principle of the proof is to find an upper bound and a lower bound for \(\sigma ^2(x)\) which converge to the same quantity. One of the main ideas of the proof is to use the fact that in a Gaussian process regression framework we consider the BLUP, i.e. the one which minimizes the MSE. Therefore, for a given Gaussian process modeling the function \(f(x)\), any LUP has a larger MSE. Furthermore, to provide a lower bound for \(\sigma ^2(x)\), we use the result presented in Theorem 1 for degenerate kernels. Therefore, we start the proof by presenting the degenerate case.

4.1 The degenerate case

The proof in the degenerate case follows the lines of the ones given by (Opper and Vivarelli 1999; Rasmussen and Williams 2006; Picheny 2009). For a degenerate kernel, the number \(\bar{p}\) of non-zero eigenvalues is finite. Let us denote \( \varLambda = \text {diag}(\lambda _i)_{1 \le i \le \bar{p}}, \phi (x) = (\phi _1(x), \ldots , \phi _{\bar{p}}(x))\) and \( \varPhi =(\phi (x_1)^T, \ldots , \phi (x_{n})^T)^T\). The MSE of the Gaussian process regression (3) is given by:
$$\begin{aligned} \sigma ^2_{\bar{p}}(x) = \phi (x) \varLambda \phi (x)^T - \phi (x) \varLambda \varPhi ^T \left( \varPhi \varLambda \varPhi ^T + n\tau I \right) ^{-1} \varPhi \varLambda \phi (x)^T. \end{aligned}$$
Thanks to the Woodbury–Sherman–Morrison formula1 and according to (Opper and Vivarelli 1999; Picheny 2009) the Gaussian process regression error can be written:
$$\begin{aligned} \sigma ^2_{\bar{p}}(x) = \phi (x) \left( \frac{\varPhi ^T \varPhi }{n\tau } + \varLambda ^{-1} \right) ^{-1}\phi (x)^T. \end{aligned}$$
Since \(\bar{p}\) is finite, by the strong law of large numbers, the \(\bar{p} \times \bar{p}\) matrix \( \varPhi ^T\Phi / n\) converges almost surely as \(n \rightarrow \infty \). Therefore, we have the almost sure convergence:
$$\begin{aligned} \sigma ^2_{\bar{p}}(x) \mathop {\longrightarrow }\limits ^{n \rightarrow \infty } \sum _{p \le \bar{p}} \frac{ \tau \lambda _p}{\tau + \lambda _p} \phi _p(x)^2. \end{aligned}$$
(9)

4.2 The lower bound for \(\sigma ^2(x)\)

The objective is to find a lower bound for the MSE \(\sigma ^2(x)\) (3) for non-degenerate kernels.

If we denote by \(a_i(x)\) the coefficients of the BLUP \(\hat{f}(x)\) associated to \(Z(x)\)—i.e. \(\hat{f}(x) = \sum _{i=1}^{n} a_i(x) (Z(x_i) + \varepsilon _i)\), the MSE can be written:
$$\begin{aligned} \sigma ^2(x)&= \mathbb {E}\left[ \left( Z(x) - \sum _{i=1}^{n} a_i(x) (Z(x_i)+ \varepsilon _i) \right) ^2\right] . \end{aligned}$$
Let us consider the Karhunen-Loève decomposition of \( Z(x) = \sum _{p \ge 0}Z_p \sqrt{\lambda _p} \phi _p(x) \) where \((Z_p)_p\) is a sequence of independent Gaussian random variables with mean zero and variance 1 and \(\lambda _p > 0\) for all \(p \in \mathbb {N}^*\). Therefore, we have the equalities \(\mathbb {E}\left[ Z_p\right] = 0, \mathbb {E}\left[ Z_p^2\right] = 1\) and \(\mathbb {E}\left[ Z_pZ_q \right] = 0\) when \(p \ne q\). Then, the MSE equals:
$$\begin{aligned} \sigma ^2(x)&= \mathbb {E}\left[ \left( \sum _{p \ge 0} \sqrt{\lambda _p} \left( \phi _p(x) - \sum _{i=1}^{n} a_i(x) \phi _p(x_i) \right) Z_p \right) ^2\right] + n\tau \sum _{i=1}^n a_i(x)^2\\&= \sum _{p \ge 0} \lambda _p \left( \phi _p(x) - \sum _{i=1}^{n} a_i(x) \phi _p(x_i) \right) ^2 + n\tau \sum _{i=1}^n a_i(x)^2. \end{aligned}$$
Then, for a fixed \(\bar{p}\), the following inequality holds:
$$\begin{aligned} \sigma ^2(x) \ge \sum _{p \le \bar{p}} \lambda _p \left( \phi _p(x) - \sum _{i=1}^{n} a_i(x) \phi _p(x_i) \right) ^2 + n\tau \sum _{i=1}^n a_i(x)^2 = \sigma ^2_{LUP,\bar{p}}(x). \end{aligned}$$
(10)
\(\sigma ^2_{LUP,\bar{p}}(x)\) is the MSE of the LUP of coefficients \(a_i(x)\) associated to the Gaussian process \(Z_{\bar{p}}(x) = \sum _{p \le \bar{p}}Z_p \sqrt{\lambda _p} \phi _p(x)\) and noisy observations with variance \(n\tau \). Let us consider \(\sigma ^2_{\bar{p}}(x)\) the MSE of the BLUP of \(Z_{\bar{p}}(x)\), we have the following inequality:
$$\begin{aligned} \sigma ^2_{LUP,\bar{p}}(x) \ge \sigma ^2_{\bar{p}}(x). \end{aligned}$$
(11)
Since \(Z_{\bar{p}}(x)\) has a degenerate kernel, the almost sure convergence given in Eq. (9) holds for \(\sigma ^2_{\bar{p}}(x)\). Then, considering inequalities (10) and (11) and the convergence (9), we obtain: \( \liminf _{n \rightarrow \infty } \sigma ^2(x) \ge \sum _{p \le \bar{p}}{\tau \lambda _p}/{ \left( \tau + \lambda _p \right) } \phi _p(x)^2 \). Taking the limit \(\bar{p} \rightarrow \infty \) gives the following lower bound:
$$\begin{aligned} \liminf _{n \rightarrow \infty } \sigma ^2(x) \ge \sum _{p \ge 0} \frac{\tau \lambda _p}{\tau + \lambda _p} \phi _p(x)^2. \end{aligned}$$
(12)

4.3 The upper bound for \(\sigma ^2(x)\)

The objective is to find an upper bound for \(\sigma ^2(x)\). Since \(\sigma ^2(x)\) is the MSE of the BLUP associated to \(Z(x)\), if we consider any other LUP associated to \(Z(x)\), then the corresponding MSE denoted by \(\sigma _{LUP}^2(x)\) satisfies the following inequality:
$$\begin{aligned} \sigma ^2(x) \le \sigma _{LUP}^2(x). \end{aligned}$$
The idea is to find a LUP so that its MSE is a sharp upper bound of \(\sigma ^2(x)\). We consider the following LUP:
$$\begin{aligned} \hat{f}_{LUP}(x) = k(x)^T A z^{n}, \end{aligned}$$
(13)
with \(A\) the \(n \times n\) matrix defined by \( A = L^{-1}+\sum _{k=1}^q(-1)^k(L^{-1}M)^kL^{-1} \) with \(L = n \tau I + \sum _{p < p^*} \lambda _p [\phi _p(x_i) \phi _p(x_j)]_{1 \le i,j \le n}, M = \sum _{p \ge p^*} \lambda _p [\phi _p(x_i) \phi _p(x_j)]_{1 \le i,j \le n}, q\) a finite integer and \(p^*\) such that \(\lambda _{p^*} < \tau \).

The choice of the LUP (13) is motivated by the fact that the matrix \(A\) is an approximation of the inverse of the matrix \((n \tau I + K)\) that is tractable in the calculations. Indeed, we have \((n \tau I + K) = L+M\) and thus \((n \tau I + K)^{-1} =L^{-1}(I+ L^{-1}M)^{-1}\). Then, the term \((I+ L^{-1}M)^{-1}\) is approximated with the sum \(\sum _{k=1}^q(-1)^k(L^{-1}M)^k\). We note that the condition \(p^*\) such that \(\lambda _{p^*} < \tau \) is used to control the convergence of this sum when \(q\) tends to the infinity.

The MSE of the LUP (13) is given by:
$$\begin{aligned} \sigma ^2_{LUP}(x) = k(x,x) - k(x)^T\left( 2A - A(n \tau I + K)A\right) k(x), \end{aligned}$$
and by substituting the expression of \(A\) into the previous equation we obtain:
$$\begin{aligned} \sigma ^2_{LUP}(x) = k(x,x) - k(x)^T L^{-1} k(x) - \sum _{i=1}^{2q+1} (-1)^ik(x)^T(L^{-1}M)^iL^{-1}k(x). \end{aligned}$$
(14)
The rest of the proof consists in finding the asymptotic values of the terms present in the expression of \(\sigma ^2_{LUP}(x)\).

First, we deal with the term \( k(x)^T L^{-1} k(x) \) with the following lemma proved in Appendix.

Lemma 1

Let us consider the term \( k(x)^T L^{-1} k(x) \) in Eq. (14). The following convergence holds:
$$\begin{aligned} k(x)^T L^{-1}k(x) \mathop {\longrightarrow }\limits ^{n \rightarrow \infty } \sum _{p \le p^*} \frac{ \lambda _p^2}{ \lambda _p + \tau } \phi _p(x)^2 + \frac{1}{\tau } \sum _{p > p^*} \lambda _p^2 \phi _p(x)^2. \end{aligned}$$
(15)
Second, let us consider the term \(\sum _{i=1}^{2q+1} (-1)^ik(x)^T(L^{-1}M)^iL^{-1}k(x)\). We have the following equality:
$$\begin{aligned} \begin{array}{lll} k(x)^T(L^{-1}M)^iL^{-1}k(x) &{} = &{} \sum _{j=0}^i \left( \begin{array}{c} i \\ j \end{array} \right) \frac{1}{n \tau } k(x)^T \left( \frac{M}{n \tau } \right) ^j \left( -\frac{L'M}{(n\tau )^2} \right) ^{i-j}k(x) \\ &{}&{}-k(x)^T \left( \frac{M}{n \tau } \right) ^j \left( -\frac{L'M}{(n\tau )^2} \right) ^{i-j}\frac{L'}{(n \tau )^2}k(x), \end{array} \end{aligned}$$
(16)
with \( L' = \varPhi _{p^*} \small {\left( \frac{\varPhi _{p^*} ^T \varPhi _{p^*} }{n\tau } + \varLambda ^{-1} \right) }^{-1} \varPhi _{p^*} ^T = \sum _{p,p' \le p^*} d^{(n)}_{p,p'}[ \phi _p(x_i) \phi _p(x_j)]_{1 \le i,j \le n} \) and \( d_{p,p'}^{(n)} = \left[ \small {\left( \frac{\varPhi _{p^*}^T \varPhi _{p^*}}{n\tau } + \varLambda ^{-1} \right) ^{-1}} \right] _{p,p'}\). Since \(q < \infty \), we can obtain the convergence in probability of \(\sum _{i=1}^{2q+1} (-1)^ik(x)^T(L^{-1}M)^iL^{-1}k(x)\) from the ones of:
$$\begin{aligned} k(x)^T\frac{1}{n }\left( \frac{M}{n } \right) ^j \left( \frac{L'M}{n^2}\right) ^{i-j} k(x), \end{aligned}$$
(17)
and:
$$\begin{aligned} k(x)^T\left( \frac{M}{n } \right) ^j \left( \frac{L'M}{n^2}\right) ^{i-j}\frac{L'}{n^2} k(x), \end{aligned}$$
(18)
with \(i \le 2q+1\) and \( j \le i\). We first study the convergence of the term (17) for \(i < j \) and the term (18) for \(i \le j\). Then, we study the convergence of (17) for \(i = j\). We have the following lemma proved in Appendix:

Lemma 2

For \(i < j\) we have the following convergence when \(n \rightarrow \infty \):
$$\begin{aligned} k(x)^T\frac{1}{n }\left( \frac{M}{n } \right) ^j \left( \frac{L'M}{n^2}\right) ^{i-j} k(x) \mathop {\longrightarrow }\limits ^{\mathbb {P}_\mu } 0, \end{aligned}$$
(19)
and for \(i \le j\) the following one holds:
$$\begin{aligned} k(x)^T\frac{1}{n }\left( \frac{M}{n } \right) ^j \left( \frac{L'M}{n^2}\right) ^{i-j} \frac{L'}{n^2} k(x) \mathop {\longrightarrow }\limits ^{\mathbb {P}_\mu } 0. \end{aligned}$$
(20)

We note that the convergences presented in Lemma 2 hold in probability. Then, we have the following lemma proved in Appendix:

Lemma 3

The following convergence holds when \(n \rightarrow \infty \):
$$\begin{aligned} \frac{1}{n} k(x)^T\left( \frac{M}{n}\right) ^i k(x) \mathop {\longrightarrow }\limits ^{\mathbb {P}_\mu } \sum _{p > p^*}\lambda _p^{i+2} \phi _p(x)^2. \end{aligned}$$
(21)
From the convergences (19) and (20) and thanks to the equality (16), we deduce the following convergence when \(n \rightarrow \infty \):
$$\begin{aligned} k(x)^T\left( L^{-1}M\right) ^iL^{-1} k(x) - \frac{1}{n \tau ^{i+1}} k(x)^T\left( \frac{M}{n}\right) ^i k(x) \mathop {\longrightarrow }\limits ^{\mathbb {P}_\mu } 0. \end{aligned}$$
Then, using the convergence (21) we obtain when \(n \rightarrow \infty \):
$$\begin{aligned} k(x)^T (L^{-1}M)^iL^{-1}k(x) \mathop {\longrightarrow }\limits ^{\mathbb {P}_\mu } \left( \frac{1}{\tau }\right) ^{i+1} \sum _{p > p^*} \lambda _p^{i+2}\phi _p(x)^2. \end{aligned}$$
(22)
From the Eq. (14) and the convergences (15) and (22), we obtain the following convergence when \(n \rightarrow \infty \):From classical results about the sum of geometric series, we have:
$$\begin{aligned} \sigma ^2_{LUP}(x) \mathop {\longrightarrow }\limits ^{\mathbb {P}_\mu } \sum _{p \ge 0} \left( \lambda _p - \frac{ \lambda _p^2}{\tau + \lambda _p} \right) \phi _p(x)^2 - \sum _{p > p^*} \lambda _p^2 \frac{\left( \frac{ \lambda _p}{\tau } \right) ^{2q+1}}{\tau + \lambda _p} \phi _p(x)^2. \end{aligned}$$
(23)
By considering the limit \(q \rightarrow \infty \) and the inequality \( \lambda _{p^*} < \tau \), we obtain the following upper bound for \(\sigma ^2(x)\):
$$\begin{aligned} \limsup _{n \rightarrow \infty } \sigma ^2(x) \le \sum _{p \ge 0} \frac{\tau \lambda _p}{\tau + \lambda _p} \phi _p(x)^2. \end{aligned}$$
(24)
The result announced in Theorem 1 is deduced from the lower and upper bounds (12) and (24).

5 Examples of rates of convergence for the learning curve

5.1 Numerical study on the assumptions of Theorem 2

Theorem 2 gives the asymptotic value of the IMSE (5) when the number of observations \(n\) increases. The aim of this section is to determine when the assumptions of Theorem 2 hold—i.e. to find the critical number of observations \(n\) beyond which, for a given \(\tau \) and a given covariance kernel \(k\), the sum in (8) is a sharp approximation of the IMSE. To perform such a study, we consider a Brownian kernel \(k(x,y) = x + y - | x - y |\) with \(x,y \in [0,1], \tau \in \{ 0.001, 0.01, 0.1\}\) and a uniform measure \(\mu \) on \([0,1]\). The eigenvalues of \(k\) are the following ones (Bronski 2003):
$$\begin{aligned} \lambda _p = \frac{1}{(p + 1/2)^2 \pi ^2}, \,\, p \in \mathbb {N}. \end{aligned}$$
Therefore, for a given \(\tau \), we can explicitly obtain the value of the sum presented in (8) and compare it with an empirical estimation of the IMSE. This empirical estimation is obtained by considering the MSE (3) built from \(n\) points randomly spread into the interval \([0,1]\) and by estimating the integral (5) with a numerical integration. Furthermore, for each pair \((\tau ,n)\) we repeat this procedure 100 times in order to obtain an empirical estimator and confidence intervals for the value of the IMSE. The results of this procedure are presented in Fig. 1.
Figure 1 represents the ratio between the value \( {\text {IMSE}}\) for a given \(n\) and the asymptotic value \(\text {IMSE}_\infty \) given by (8). For large \(n\) this ratio is close to one. This allows for representing the convergence of the IMSE to its asymptotic value.
Fig. 1

Comparison between the IMSE for different \(n\) and the theoretical asymptotic value \(\text {IMSE}_\infty \) given by the sum (8). The ratio \({\text {IMSE}}/\text {IMSE}_\infty \) is plotted as a function of \(n\) for three values of \(\tau \). For each pair \((\tau ,n)\) 100 approximations of \({\text {IMSE}}\) are evaluated from design points randomly spread on \([0,1]\). From them the empirical mean (represented by the dashed lines) and the 5 % and 95 % confidence intervals (represented by the dotted lines) of the ratio \({\text {IMSE}} / \text {IMSE}_\infty \) are evaluated

We observe in Fig. 1 that the convergence is effective for \(n < 100\) for all values of \(\tau \). The convergence is robust for small values of \(\tau \): the asymptotic value (8) is a good approximation of the IMSE if \(n \ge 5\) for \(\tau = 10^{-1}\), if \(n \ge 20\) for \(\tau = 10^{-2}\) and if \(n \ge 60\) for \(\tau = 10^{-3}\). This corresponds approximately to the threshold values \(n \tau = 0.5\) for \( \text {IMSE}_\infty = 0.1575, n \tau = 0.2\) for \( \text {IMSE}_\infty = 0.05\) and \(n \tau = 0.06\) for \( \text {IMSE}_\infty = 0.0.0158\) ; or globally to \(n \tau \approx 4 \text {IMSE}_\infty \).

This highlights the relevance of the asymptotic value of the IMSE given in Theorem 2. However, in general we do not have an explicit expression for the eigenvalues of a covariance kernel. In this case, we can obtain the asymptotic expression of \(\text {IMSE}_\infty \) for small \(\tau \) from the asymptotic behavior of the eigenvalues \((\lambda _p)_{p \ge 0}\) for large \(p\). We deal with this issue in the next subsection.

5.2 Rate of convergence for some usual kernels

Theorem 2 gives the asymptotic value of the generalization error as a function of the eigenvalues of the covariance kernel. However, this asymptotic value is hard to handle since the expression of the eigenvalues is rarely known. To deal with this problem, we introduce in Proposition 1 a quantity \(B_\tau \) which has the same rate of convergence of the asymptotic value of the generalization error and which is tractable for our purpose.

Proposition 1

Let us denote \(\text {IMSE}_\infty = \lim _{n \rightarrow \infty }\text {IMSE} \). The following inequality holds:
$$\begin{aligned} \frac{1}{2}B_\tau \le \text {IMSE}_\infty \le B_\tau , \end{aligned}$$
(25)
with \(B_\tau = \sum _{p\;\text {s.t.}\;\lambda _p \le \tau } \lambda _p + \tau \# \left\{ p\;\text {s.t.}\;\lambda _p > \tau \right\} \).

Proof

The proof is directly deduced from Theorem 2 and the following inequality:
$$\begin{aligned} \frac{1}{2} h_\tau (x) \le \frac{x}{x + \tau } \le h_\tau (x), \end{aligned}$$
with:
$$\begin{aligned} h_\tau (x) = \left\{ \begin{array}{ll} x /\tau &{} x \le \tau \\ 1 &{} x > \tau \\ \end{array} \right. . \end{aligned}$$
\(\square \)

Proposition 1 shows that the rate of convergence of the generalization error \(\text {IMSE}_\infty \) as a function of \(\tau \) is equivalent to the one of \(B_\tau \). In this section, we analyze the rate of convergence of \(\text {IMSE}_\infty \) (or equivalently \(B_\tau \)) when \(\tau \) is small.

In this section, we consider that the design measure \(\mu \) is uniform on \([0,1]^d\).

Example

2 (Degenerate kernels) For degenerate kernels we have \( \# \left\{ p\;\text {s.t.}\; \lambda _p > 0 \right\} < \infty \). Thus, when \(\tau \rightarrow 0\), we have:
$$\begin{aligned} \sum _{p\;\text { s.t. }\;\lambda _p < \tau } \lambda _p = 0, \end{aligned}$$
from which:
$$\begin{aligned} B_\tau \propto \tau . \end{aligned}$$
(26)

Therefore, the IMSE decreases with \(\tau \). We find here a classical result about Monte-Carlo convergence which gives that the variance decay is proportional to the observation noise variance (\(n\tau \)) divided by the number of observations (\(n\)) for any dimension. Nevertheless, for non-degenerate kernels, the number of non-zero eigenvalues is infinite and we are hence in an infinite-dimensional case (contrarily to the degenerate one). We see in the following examples that we do not conserve the usual Monte-Carlo convergence rate in this case which emphasizes the importance of Theorem 1 dealing with non-degenerate kernels.

Example

3 (The fractional Brownian motion) Let us consider the fractional Brownian kernel with Hurst parameter \(H \in (0,1)\):
$$\begin{aligned} k(x,y) = x^{2H} + y^{2H} - |x-y|^{2H}. \end{aligned}$$
(27)

The associated Gaussian process—called fractional Brownian motion—is Hölder continuous with exponent \(H-\varepsilon , \forall \varepsilon > 0\). According to (Bronski 2003), we have the following result:

Lemma 4

The eigenvalues of the fractional Brownian motion with Hurst exponent \(H \in (0,1)\) satisfy the behavior
$$\begin{aligned} \lambda _p = \frac{\nu _H}{p^{2H+1}} + o \left( p^{-\frac{(2H+2)(4H+3)}{4H+5}+\delta } \right) , \quad p\gg 1, \end{aligned}$$
where \(\delta > 0\) is arbitrary, \(\nu _H= \frac{\text {sin}(\pi H)\Gamma (2H+1)}{ \pi ^{2H+1}}\), and \(\Gamma \) is the Euler Gamma function.
Therefore, when \(\tau \ll 1\), we have:
$$\begin{aligned} \lambda _p < \tau \quad \text {if} \quad p > \left( \frac{ \nu _H}{\tau } \right) ^{\frac{1}{2H+1}}. \end{aligned}$$
We hence have the following approximation for \(B_\tau \):
$$\begin{aligned} B_\tau \approx \sum _{p > \left( \frac{ \nu _H}{\tau } \right) ^{\frac{1}{2H+1}}} \frac{\nu _H}{p^{2H+1}} + \tau \left( \frac{ \nu _H}{\tau } \right) ^{\frac{1}{2H+1}}. \end{aligned}$$
Furthermore, we have:
$$\begin{aligned} \sum _{p > \left( \frac{ \nu _H}{\tau } \right) ^{\frac{1}{2H+1}}} \frac{\nu _H}{p^{2H+1}} \approx \int \limits _{\left( \frac{ \nu _H}{\tau } \right) ^{\frac{1}{2H+1}}}^{+\infty } \frac{\nu _H}{x^{2H+1}} \, dx = \frac{\nu _H}{2H\left( \frac{ \nu _H}{\tau } \right) ^{1-\frac{1}{2H+1}}}, \end{aligned}$$
from which:
$$\begin{aligned} B_\tau \approx C_H \tau ^{1-\frac{1}{2H+1}}, \quad \tau \ll 1, \end{aligned}$$
(28)
where \(C_H\) is a constant independent of \(\tau \).

The rate of convergence for a fractional Brownian motion with Hurst parameter \(H\) is \(\tau ^{1-\frac{1}{2H+1}}\). We note that the case \(H = 1/2\) corresponds to the classical Brownian motion. We observe that the larger the Hurst parameter is (i.e. the more regular the Gaussian process is), the faster the convergence is. Furthermore, for \(H \rightarrow 1\) the convergence rate gets close to \(\tau ^{2/3}\). Therefore, even for the most regular fractional Brownian motion, we are still far from the classical Monte-Carlo convergence rate.

Example

4 (The 1-D Matérn covariance kernel) In this example we deal with the Matérn kernel with regularity parameter \(\nu > 0\) in dimension 1:
$$\begin{aligned} k_{1D}(x,x';\nu ,l)=\frac{2^{1- \nu }}{\varGamma (\nu )}\left( \frac{\sqrt{2\nu }|x-x'|}{l}\right) ^\nu K_\nu \left( \frac{\sqrt{2\nu } |x-x'|}{l} \right) , \end{aligned}$$
(29)
where \(K_\nu \) is the modified Bessel function (Abramowitz and Stegun 1965). The eigenvalues of this kernel satisfy the following asymptotic behavior (Nazarov and Nikitin 2004):
$$\begin{aligned} \lambda _p \approx \frac{1}{p^{2( \nu +1/2)}}, \quad p \gg 1. \end{aligned}$$
Following the guideline of the Example 3 we deduce the following asymptotic behavior for \(B_\tau \):
$$\begin{aligned} B_\tau \approx C_\nu \tau ^{1-\frac{1}{2( \nu +1/2)}}, \quad \tau \ll 1, \end{aligned}$$
(30)
where \(C_\nu \) is a constant independent of \(\tau \).

This result is in agreement with the one of Ritter (2000a) who proved that for 1-dimensional kernels satisfying the Sacks–Ylvisaker of order \(r\) conditions (where \(r\) is an integer), the generalization error for the best linear estimator and experimental design set strategy decays as \(\tau ^{1-\frac{1}{2r+2}}\). Indeed, for such kernels, the eigenvalues satisfy the large-\(p\) behavior \(\lambda _p \propto 1/p^{2r+2}\) (Rasmussen and Williams 2006) and by following the guideline of the previous examples we find the same convergence rate. We note that the Matérn kernel with parameter \(\nu = r + 1/2\) satisfies the Sacks–Ylvisaker of order \(r\) conditions.

Example

5 (The d-D tensorial Matérn covariance kernel) We focus here on the \(d\)-dimensional tensorial Matérn kernel with isotropic regularity parameter \(\nu > \frac{1}{2}\). According to Pusev (2011) the eigenvalues of this kernel satisfy the asymptotics:
$$\begin{aligned} \lambda _p \approx \phi (p), \quad p\gg 1, \end{aligned}$$
where the function \(\phi \) is defined by:
$$\begin{aligned} \phi (p) = \frac{\text {log}(1+p)^{2(d-1)( \nu +1/2)}}{p^{2 ( \nu +1/2)}}. \end{aligned}$$
Its inverse \(\phi ^{-1}\) satisfies:
$$\begin{aligned} \phi ^{-1}(\varepsilon ) = \varepsilon ^{-\frac{1}{2( \nu +1/2)}} \left( \text {log}\left( \varepsilon ^{-\frac{1}{2( \nu +1/2)}}\right) \right) ^{d-1}(1+ o (1)), \quad \varepsilon \ll 1. \end{aligned}$$
We hence have the approximation:
$$\begin{aligned} B_\tau \approx \frac{2 ( \nu +1/2)-1}{{\phi ^{-1}\left( \tau \right) }^{2( \nu +1/2)-1}} \text {log}\left( 1+\phi ^{-1}\left( \tau \right) \right) ^{2(d-1)( \nu +1/2)} + \tau \phi ^{-1}\left( \tau \right) . \end{aligned}$$
We can deduce the following rate of convergence for \(B_\tau \):
$$\begin{aligned} B_\tau \approx C_{( \nu +1/2),d} \tau ^{1-\frac{1}{2( \nu +1/2)}} \text {log}\left( 1/\tau \right) ^{d-1}, \quad \tau \ll 1, \end{aligned}$$
(31)
with \(C_{\nu ,d}\) a constant independent of \(\tau \).

Example

6 (The d-D Gaussian covariance kernel) According to Todor, 2006 the asymptotic behavior of the eigenvalues for a Gaussian kernel is:
$$\begin{aligned} \lambda _p \lesssim \text {exp}\left( -p^{\frac{1}{d}}\right) . \end{aligned}$$
Applying the procedure presented in the previous examples, it can be shown than the rate of convergence of the IMSE is bounded by:
$$\begin{aligned} C_d \tau \text {log}\left( 1/\tau \right) ^d, \quad \tau \ll 1, \end{aligned}$$
(32)
with \(C_d\) a constant independent of \(\tau \).

Remark

We can see from the previous examples that for smooth kernels, the convergence rate is close to \(\tau \), i.e. the classical Monte-Carlo rate.

5.3 Numerical examples

We compare the previous theoretical results on the rate of convergence of the generalization error with full numerical simulations. In order to observe the asymptotic convergence, we fix \(n = 200\) and we consider \(1/\tau \) varying from \(50\) to \(1000\). The experimental design sets are sampled from a uniform measure on \([0,1]\) and the observation noise is \(n\tau \). To estimate the IMSE (5) we use a trapezoidal numerical integration with 4000 quadrature points over \([0,1]\). Furthermore, to build the convergence curves in Figs. 2 and 3 we use a linear regression with the first value of the IMSE, an intercept fixed to zero (since the IMSE tends to 0 when \(\tau \) tends to 0) and a unique explanatory variable corresponding to the tested convergence (e.g. \(\tau ^{0.1}, \tau \log (1 / \tau )\), ...).

First, we deal with the 1-D fractional Brownian kernel (27) with Hurst parameter \(H\). We have proved that for large \(n\), the IMSE decays as \(\tau ^{1-\frac{1}{2H+1}}\). Figure 2 compares the numerically estimated convergences to the theoretical ones.
Fig. 2

Rate of convergence of the IMSE when the level of observation noise decreases for a fractional Brownian motion with Hurst parameter \(H=0.5\) (left) and \(H=0.9\) (right). The number of observations is \(n=200\) and the observation noise variance is \(n \tau \) with \(1/\tau \) varying from \(50\) to \(1000\). The triangles represent the numerically estimated IMSE, the solid line represents the theoretical convergence, and the other non-solid lines represent various convergence rates

We see in Fig. 2 that the observed rate of convergence is perfectly fitted by the theoretical one. We note that we are far from the classical Monte-Carlo rate since we are not in a non-degenerate case.

Finally, we deal with the 2-D tensorial Matérn-\({5}/{2}\) kernel and the 1-D Gaussian kernel. The 1-dimensional Matérn-\(\nu \) class of covariance functions \(k_{1D}(t,t';\nu ,\theta ) \) is given by (29) and the 2-D tensorial Matérn-\(\nu \) covariance function is given by:
$$\begin{aligned} k(x,x';\nu ,\theta ) = k_{1D}(x_1,x_1';\nu ,\theta _1) k_{1D}(x_2,x_2';\nu ,\theta _2). \end{aligned}$$
(33)
Furthermore, the 1-D Gaussian kernel is defined by:
$$\begin{aligned} k(x,x';\theta ) = \text {exp}\left( -\frac{1}{2} \frac{(x-x')^2}{\theta ^2} \right) . \end{aligned}$$
Figure 3 compares the numerically observed convergence of the IMSE to the theoretical one when \(\theta _1 = \theta _2 = 0.2\) for the Matérn-\({5}/{2}\) kernel and when \(\theta = 0.2\) for the Gaussian kernel. We see in Fig. 3 that the theoretical rate of convergence is a sharp approximation of the observed one.
Fig. 3

Rate of convergence of the IMSE when the level of observation noise decreases for a 2-D tensorial Matérn-\({5}/{2}\) kernel on the left hand side and for a 1-D Gaussian kernel on the right hand side. The number of observations is \(n=200\) and the observation noise variance is \(n \tau \) with \(1/\tau \) varying from \(100\) to \(1000\). The triangles represent the numerically estimated IMSE, the solid line represents the theoretical convergence, and the other non-solid lines represent various convergences

6 Applications of the learning curve

Let us consider that we want to approximate the function \(x\in \mathbb {R}^d \rightarrow f(x) \) from noisy observations at fixed points \((x_i)_{i=1,\ldots ,n}\), with \(n \gg 1\), sampled from the design measure \(\mu \) and with \(s\) replications at each point \(x_i\). In Sect. 6.1 we present how to determine the needed budget \(T=ns\) to achieve a prescribed precision. Then, in Sect. 6.2, we illustrate this method on an industrial example.

6.1 Estimation of the budget required to reach a prescribed precision

Let us consider a prescribed generalization error denoted by \({\bar{\varepsilon }}\). The purpose of this subsection is to determine from an initial budget \(T_0\) the budget \(T\) for which the generalization error reaches the value \({\bar{\varepsilon }}\).

First, we build an initial experimental design set \((x_i^\text {train})_{i=1,\ldots ,n}\) sampled with respect to the design measure \(\mu \) and with \(s^*\) replications at each point such that \(T_0=ns^*\). From the \(s^*\) replications \((z_{i,j})_{j=1,\ldots ,s^*}\), we can estimate the observation noise variances \(\sigma _\varepsilon ^2\) with a classical empirical estimator: \( \bar{\sigma }^2_\varepsilon = \sum _{i=1}^n \sum _{j=1}^{s^*}(z_{i,j}-z_i^n)^2/(n(s^*-1)),\, z_i^n = \sum _{j=1}^{s^*}z_{i,j}/s^*\).

Second, we use the observations \(z_i^n = (\sum _{j=1}^{s^*}z_{i,j})/s^*\) to estimate the covariance kernel \(k(x,x')\). In practice, we consider a parametrized family of covariance kernels and we select the parameters which maximize the likelihood (Stein 1999).

Third, from Theorem 2 we can get the expression of the generalization error decay with respect to \(T\) (denoted by \(\hbox {IMSE}_T\)). Therefore, we just have to determine the budget \(T\) such that \(\text {IMSE}_T = \varvec{\bar{\varepsilon }}\). In practice, we will not use Theorem 2 but the asymptotic results described in Sect. 5.2.

This strategy is applied to an industrial case in Sect. 6.2. We note that in the application presented in Sect. 6.2, we have \(s^*=1\). In fact, in this example the observations are themselves obtained by an empirical mean of a Monte-Carlo sample and thus the noise variance can be estimated without processing replications.

6.2 Industrial case: MORET code

We illustrate in this section an industrial application of our results about the rate of convergence of the IMSE.

6.2.1 Data presentation

We use in this section the notation presented in Sect. 2.2. The outputs of the MORET code at point \(x_i\) are denoted by \(Y_j(x_i)\) where \(j=1,\ldots ,s_i\) and \(i=1,\ldots ,n\).

A large data base \((Y_j(x_i))_{ i=1,\ldots ,5625, j=1,\ldots ,200}\) is available to us. We divide it into a training set and a test set. The 5625 points \(x_i\) of the data base come from a \(75 \times 75\) grid over \([0,1]^2\). The training set consists of \(n= 100\) points \((x_i^{\text {train}})_{i=1,\ldots ,n}\) extracted from the complete data base using a Latin Hypercube Sample (Fang et al. 2006) optimized with respect to the maximin criterion and of the first observations \((Y_1(x_i^{\text {train}}))_{i=1,\ldots ,100}\). We note that the maximin criterion aims to maximize the minimal distance (with respect to the \(L_2\)-norm) between the points of the design. We will use the other 5525 points as a test set.

The aim of the study is—given the training set—to predict the budget needed to achieve a prescribed precision for the surrogate model.

Furthermore, the observation noise variance \(\sigma ^2_{\varepsilon }\) is estimated by \( \bar{\sigma }^2_\varepsilon = 3.3\times 10^{-3}\) (see Sect. 6.1).

6.2.2 Model selection

To build the model, we consider the training set plotted in Fig. 4. It is composed of the \(n=100\) points \((x_i^{\text {train}})_{i=1,\ldots ,n}\) which are uniformly spread on \(Q = [0,1]^2\).
Fig. 4

Initial experimental design set with \(n=100\)

Let us suppose that the response is a realization of a Gaussian process with a tensorial Matérn-\(\nu \) covariance function. The 2-D tensorial Matérn-\(\nu \) covariance function \(k(x,x';\nu ,\theta )\) is given in (33). The hyper-parameters are estimated by maximizing the concentrated Maximum Likelihood (Stein 1999):
$$\begin{aligned} -\frac{1}{2}(z-m)^T(\sigma ^2K+\sigma ^2_{\varepsilon }I)^{-1}(z-m)-\frac{1}{2} \text {det}(\sigma ^2K+\bar{\sigma }^2_\varepsilon I), \end{aligned}$$
where \(K = [k(x_i^\text {train},x_j^\text {train};\nu ,\theta ) ]_{i,j=1,\ldots ,n}, I\) is the identity matrix, \(\sigma ^2\) the variance parameter, \(m\) the mean of \(k_{\text {eff},s}(x) \) and \(z = (Y_1(x_1^\text {train}), \ldots , Y_1(x_n^\text {train}))\) the observations at points in the training set. The mean of \(k_{\text {eff},s}(x) \) is estimated by \(m=\frac{1}{100} \sum _{i=1}^{100} Y_{1}(x_i^{\text {train}}) = 0.65\).

Due to the fact that the convergence rate is strongly dependent of the regularity parameter \(\nu \), we have to perform a good estimation of this hyper-parameter to evaluate the model error decay accurately. Note that we cannot have a closed form expression for the estimator of \(\sigma ^2\), it hence has to be estimated jointly with \(\theta \) and \(\nu \).

Let us consider the vector of parameters \(\phi = (\nu , \theta _1, \theta _2, \sigma ^2)\). In order to perform the maximization, we have first randomly generated a set of 10,000 parameters \((\phi _k)_{k=1,\ldots ,10^4}\) on the domain \([0.5,3] \times [0.01,2] \times [0.01,2]\times [0.01,1]\). We have then selected the 150 best parameters (i.e. the ones maximizing the concentrated Maximum Likelihood) and we have started a quasi-Newton based maximization from these parameters. More specifically, we have used the BFGS method (Shanno 1970). Finally, from the results of the 150 maximization procedures, we have selected the best parameter. We note that the quasi-Newton based maximizations have all converged to two parameter values, around 30 % to the actual maximum and 70 % to another local maximum.

The estimation of the hyper-parameters are \(\nu = 1.31\), \(\theta _1 = 0.67, \theta _2 = 0.45\) and \(\sigma ^2= 0.24 \). This means that we have a rough surrogate model which is not differentiable and \(\alpha \)-Hölder continuous with exponent \(\alpha = 0.81\). The variance of the observations is \(\bar{\sigma }^2_\varepsilon =3.3\times 10^{-3}\), using the same notations as Sect. 2.3, we have \(\tau = \bar{\sigma }^2_\varepsilon /T_0 \) with \(T_0 = n\) (it corresponds to \(s=1\)).

The IMSE of the Gaussian process regression is \(\text {IMSE}_{T_0}=1.0\times 10^{-3}\) and its empirical mean squared error is \(\text {EMSE}_{T_0} = 1.2\times 10^{-3}\). To compute the empirical mean squared error (EMSE), we use the observations \((Y_j(x_i))_{i=1,\ldots ,5525,\, j=1 \ldots , 200}\) with \(x_i \ne x_k^{\text {train}}\) \(\forall k=1,\ldots ,100, i=1,\ldots , 5525\) and to compute the IMSE (5) (that depends only on the positions of the training set and on the selected hyper-parameters) we use a trapezoidal numerical integration into a \(75 \times 75\) grid over \([0,1]^2\). For \(s=200\), the observation variance of the output \(k_{\text {eff},s}(x)\) equals \({ \bar{\sigma }^2_\varepsilon }/{200}=1.64\times 10^{-5}\) and is neglected for the estimation of the empirical error. We can see that the IMSE is close to the empirical MSE which means that our model describes the observations accurately.

6.2.3 Convergence of the IMSE

According to (31), we have the following convergence rate for the IMSE:
$$\begin{aligned} \text {IMSE} \sim \text {log}(1/\tau ) \tau ^{1-\frac{1}{2(\nu +1/2)}} = \frac{\text {log}(T/\bar{\sigma }^2_{\varepsilon })}{(T/\bar{\sigma }^2_{\varepsilon })^{1-\frac{1}{2(\nu +1/2)}}}, \end{aligned}$$
(34)
where the model parameter \(\nu \) plays a crucial role. We can therefore expect that the IMSE decays as (see Sect. 6.1):
$$\begin{aligned} \text {IMSE}_T = \text {IMSE}_{T_0} \frac{\text {log}(T/\bar{\sigma }^2_{\varepsilon })}{(T/\bar{\sigma }^2_{\varepsilon })^{1-\frac{1}{2(\nu +1/2)}}}/ \frac{\text {log}(T_0/\bar{\sigma }^2_{\varepsilon })}{(T_0/\bar{\sigma }^2_{\varepsilon })^{1-\frac{1}{2(\nu +1/2)}}}. \end{aligned}$$
(35)
Let us assume that we want to reach an IMSE of \(\varvec{\bar{\varepsilon }} = 2.0\times 10^{-4}\). According to the IMSE decay and the fact that the IMSE for the budget \(T_0\) has been estimated to be equal to \(1.0\times 10^{-3}\), the total budget required is \(T = ns = 2000\), i.e. \(s = 20\). Figure 5 compares the empirical mean squared error convergence and the predicted convergence (35) of the IMSE.
Fig. 5

Comparison between empirical mean squared error (EMSE) decay and theoretical IMSE decay for \(n=100\) when the total budget \(T=ns\) increases. The triangles represent the EMSE, the solid line represents the theoretical decay, the horizontal dashed line represents the desired accuracy and the dashed line the classical M-C convergence. We see that Monte-Carlo decay does not match the empirical MSE and it is too fast

We see empirically that the EMSE of \(\varvec{\bar{\varepsilon }} = 2.0\times 10^{-4}\) is achieved for \(s = 31\). This shows that the predicted IMSE and the empirical MSE are close and that the selected kernel captures the regularity of the response accurately.

Let us consider the classical Monte-Carlo convergence rate \( \bar{\sigma }_\varepsilon ^2/T\), which corresponds to the convergence rate of degenerate kernels, i.e. in the finite-dimensional case. Figure 5 compares the theoretical rate of convergence of the IMSE with the classical Monte-Carlo one. We see that the Monte-Carlo decay is too fast and does not represent correctly the empirical MSE decay. If we had considered the rate of convergence \(\text {IMSE} \sim \bar{\sigma }_\varepsilon ^2/T\), we would have reached an IMSE of \(\varvec{\bar{\varepsilon }} = 2.0\times 10^{-4}\) for \(s = 6\) (which is very far from the observed value \( s = 31\)).

7 Conclusion

The main result of this paper is the proof of a theorem giving the Gaussian process regression MSE when the number of observations is large and the observation noise variance is proportional to the number of observations. The proof generalizes previous ones which prove this result in dimension one or two or for a restricted class of covariance kernels (for degenerate ones).

A first limitation of the presented results is that the noise variance generally does not depend on the number of observations. The additive dependence of the noise variance in the number of observations is a technical assumption which allows for controlling the convergence of the learning curve. However, it is natural in the framework of experiments with replications or Monte-Carlo simulators. Deriving the presented results for the case of constant noise is still an open problem and is of great practical interest.

The asymptotic value of the MSE is derived in terms of the eigenvalues and eigenfunctions of the covariance function and holds for degenerate and non-degenerate kernels and for any dimension. From this theorem, we can deduce the asymptotic behavior of the generalization error—defined in this paper as the IMSE—as a function of the reduced observation noise variance (it corresponds to the noise variance when the number of observations equals one). A strength of this theorem is that the rate of convergence of the generalization error can be deduced from the one of the eigenvalues which is known for usual covariance kernels. The relevance of this rate of convergence is emphasized on a numerical study for different kernels. However, this leads to another limitation since the presented results cannot be used for general covariance kernels for which the eigenvalue decay rate is unknown.

The significant differences between the rate of convergence of degenerate and non-degenerate kernels highlight the importance to prove this result for non-degenerate kernels. This is especially important as usual kernels for Gaussian process regression are non-degenerate.

Finally, for practical perspectives, the presented method allows for evaluating the computational budget required to reach a given accuracy. It has been successfully applied to a real-word problem about the safety assessment of a nuclear system. However, it is efficient for specific applications (e.g. stochastic simulators with a constant observation noise variance) and when the computational budget is important. More investigations have to be performed to deal with the cases of heterogeneous noise, noise-free simulators or for very limited computational budget.

Footnotes

  1. 1.

    If \(B\) is a non-singular \(p\times p\) matrix, \(C\) a non-singular \(m \times m\) matrix and \(A\) a \(m \times p\) matrix with \(m,p < \infty \), then \((B+AC^{-1}A)^{-1} = B^{-1} - B^{-1}A(A^TB^{-1}A+C)^{-1}A^TB^{-1}\).

Notes

Acknowledgments

The authors are grateful to Dr. Yann Richet of the IRSN—Institute for Radiological Protection and Nuclear Safety—for providing the data for the industrial case through the reDICE project.

References

  1. Abramowitz, M., & Stegun, I. A. (1965). Handbook of mathematical functions. New York: Dover.MATHGoogle Scholar
  2. Berger, J. O., De Oliveira, V., & Sans, B. (2001). Objective bayesian analysis of spatially correlated data objective bayesian analysis of spatially correlated data. Journal of the American Statistical Association, 96, 1361–1374.MathSciNetCrossRefMATHGoogle Scholar
  3. Bozzini, M., & Rossini, M. (2003). Numerical differentiation of 2d functions from noisy data. Computer and Mathematics with Applications, 45, 309–327.MathSciNetCrossRefMATHGoogle Scholar
  4. Bronski, J. C. (2003). Asymptotics of Karhunen–Loève eigenvalues and tight constants for probability distributions of passive scalar transport. Communications in Mathematical Physics, 238, 563–582.MathSciNetCrossRefMATHGoogle Scholar
  5. Fang, K. T., Li, R., & Sudjianto, A. (2006). Design and modeling for computer experiments. Computer science and data analysis series. London: Chapman & Hall.MATHGoogle Scholar
  6. Fernex, F., Heulers, L., Jacquet, O., Miss, J., Richet, Y. (2005). The Moret 4b monte carlo code new features to treat complex criticality systems. In: MandC International Conference on Mathematics and Computation Supercomputing, Reactor and Nuclear and Biological Application, Avignon, France.Google Scholar
  7. Gneiting, T., Kleiber, W., & Schlater, M. (2010). Matérn cross-covariance functions for multivariate random fields. Journal of the American Statistical Association, 105, 1167–1177.MathSciNetCrossRefMATHGoogle Scholar
  8. Harville, D. A. (1997). Matrix algebra from statistician’s perspective. New York: Springer-Verlag.CrossRefMATHGoogle Scholar
  9. Laslett, G. M. (1994). Kriging and splines: An empirical comparison of their predictive performance in some applications kriging and splines: An empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, 89, 391–400.MathSciNetCrossRefGoogle Scholar
  10. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A, 209, 441–458.CrossRefMATHGoogle Scholar
  11. Nazarov, A. I., & Nikitin, Y. Y. (2004). Exact \(\text{ l }_2\)-small ball behaviour of integrated Gaussian processes and spectral asymptotics of boundary value problems. Probability Theory and Related Fields, 129, 469–494.MathSciNetCrossRefMATHGoogle Scholar
  12. Opper, M., & Vivarelli, F. (1999). General bounds on Bayes errors for regression with Gaussian processes. Advances in Neural Information Processing Systems, 11, 302–308.Google Scholar
  13. Picheny, V. (2009). Improving accuracy and compensating for uncertainty in surrogate modeling. PhD thesis, Ecole Nationale Supérieure des Mines de Saint Etienne.Google Scholar
  14. Pusev, R. S. (2011). Small deviation asymptotics for Matérn processes and fields under weighted quadratic norm. Theory of Probability and its Applications, 55, 164–172.Google Scholar
  15. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.MATHGoogle Scholar
  16. Ritter, K. (2000a). Almost optimal differentiation using noisy data. Journal of Approximation Theory, 86, 293–309.MathSciNetCrossRefMATHGoogle Scholar
  17. Ritter, K. (2000b). Average-case analysis of numerical problems. Berlin: Springer Verlag.CrossRefMATHGoogle Scholar
  18. Sacks, J., & Ylvisaker, D. (1981). Variance estimation for approximately linear models. Series Statistics, 12, 147–162.MathSciNetCrossRefMATHGoogle Scholar
  19. Sacks, J., Welch, W. J., Mitchell, T. J., & Wynn, H. P. (1989). Design and analysis of computer experiments. Statistical Science, 4, 409–423.MathSciNetCrossRefMATHGoogle Scholar
  20. Seeger, M. W., Kakade, S. M., & Foster, D. P. (2008). Information consistency of nonparametric Gaussian process methods. IEEE Transactions on Information Theory, 54(5), 2376–2382.Google Scholar
  21. Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24, 647–656.MathSciNetCrossRefMATHGoogle Scholar
  22. Sollich, P., & Halees, A. (2002). Learning curves for Gaussian process regression: Approximations and bounds. Neural Computation, 14, 1393–1428.CrossRefMATHGoogle Scholar
  23. Stein, M. L. (1999). Interpolation of spatial data. Series in statistics. New York: Springer.CrossRefGoogle Scholar
  24. van der Vaart, A., & van Zanten, H. (2011). Information rates of nonparametric Gaussian process methods. The Journal of Machine Learning Research, 12, 2095–2119.Google Scholar
  25. Wackernagel, H. (2003). Multivariate geostatistics. Berlin: Springer-Verlag.CrossRefMATHGoogle Scholar
  26. Williams, C. K. I., & Vivarelli, F. (2000). Upper and lower bounds on the learning curve for Gaussian processes. Machine Learning, 40, 77–102.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  1. 1.Université Paris DiderotParis Cedex 13France
  2. 2.CEA, DAM, DIFArpajonFrance
  3. 3.Laboratoire de Probabilites et Modeles Aleatoires & Laboratoire Jacques-Louis LionsUniversite Paris DiderotParis Cedex 13France

Personalised recommendations